Using cwl-airflow¶

Configuration¶

Before using cwl-airflow it should be initialized with the default configuration by running the command

$ cwl-airflow init

Optional parameters:

Flag	Description	Default
-l	number of processed jobs kept in history, int	10 x Running
		10 x Success
		10 x Failed
-j	path to the folder where all the new jobs will be added, str	~/airflow/jobs
-t	timeout for importing all the DAGs from the DAG folder, sec	30
-r	webserver workers refresh interval, sec	30
-w	number of webserver workers to be refreshed at the same time, int	1
-p	number of threads for Airflow Scheduler, int	2

Consider using -r 5 -w 4 to make Airflow Webserver react faster on all newly created DAGs

If you update Airflow configuration file manually (default location is ~/airflow/airflow.cfg), make sure to run cwl-airflow init command to apply all the changes, especially if core/dags_folder or cwl/jobs parameters from the configuration file are changed.

Submitting new job¶

To submit new CWL descriptor and Job files to cwl-airflow run the following command

cwl-airflow submit WORKFLOW JOB

Optional parameters:

Flag	Description	Default
-o	path to the folder where all the output files	current directory
	should be moved after successful workflow execution, str
-t	path to the temporary folder for storing intermediate results, str	/tmp/cwlairflow
-u	ID for DAG's unique identifier generation, str	random uuid
-r	run submitted workflow with Airflow Scheduler, bool	False

Arguments -o, -t and -u doesn’t overwrite the values from the Job file set in the fields output_folder, tmp_folder and uid correspondingly. The meaning of these fields is explaned in How It Works section.

The submit command will resolve all relative paths from Job file adding mandatory fields workflow, output_folder and uid (if not provided) and will copy Job file to the Jobs folder. The CWL descriptor file and all input files referenced in the Job file should not be moved or deleted while workflow is running. The submit command will not execute submitted workflow unless -r argument is provided. Otherwise, make sure that Airflow Scheduler (and optionally Airflow Webserver) is running. Note, that -r argument was added only to comply with the interface through which CWL community runs it’s conformance tests. So it’s more preferable to execute submitted workflow with Airflow Scheduler, especially if you are planning to use LocalExecutor instead of default SequentialExecutor.

Depending on your Airflow configuration it may require some time for Airflow Scheduler and Webserver to pick up new DAGs. Consider using cwl-airflow init -r 5 -w 4 to make Airflow Webserver react faster on all newly created DAGs.

To start Airflow Scheduler (don’t run it if cwl-airflow submit is used with -r argument)

airflow scheduler

To start Airflow Webserver (by default it is accessible from your localhost:8080)

airflow webserver

Please note that both Airflow Scheduler and Webserver can be adjusted through the configuration file (default location is ~/airflow/airflow.cfg). Refer to the official documentation

Demo mode¶

To get the list of the available demo workflows
```
$ cwl-airflow demo --list
```
To submit the specific demo workflow from the list (workflow will not be run until Airflow Scheduler is started separately)
```
$ cwl-airflow demo super-enhancer.cwl
```
Depending on your Airflow configuration it may require some time for Airflow Scheduler and Webserver to pick up new DAGs. Consider using cwl-airflow init -r 5 -w 4 to make Airflow Webserver react faster on all newly created DAGs.
To submit all demo workflows from the list (workflows will not be run until Airflow Scheduler is started separately)
```
$ cwl-airflow demo --manual
```
Before submitting demo workflows the Jobs folder will be automatically cleaned.
To execute all available demo workflows (automatically starts Airflow Scheduler and Airflow Webserver)
```
$ cwl-airflow demo --auto
```

Optional parameters:

Flag	Description	Default
-o	path to the folder where all the output files should be moved after successful workflow execution, str	current directory
-t	path to the temporary folder for storing intermediate results, str	/tmp/cwlairflow
-u	ID for DAG's unique identifier generation, str. Ignored when --list or --auto is used	random uuid

Running sample ChIP-Seq-SE workflow¶

This ChIP-Seq-SE workflow is a CWL version of a Python pipeline from BioWardrobe. It starts by extracting an input FASTQ file (if it was compressed). Next step runs BowTie to perform alignment to a reference genome, resulting in an unsorted SAM file. The SAM file is then sorted and indexed with Samtools to obtain a BAM file and a BAI index. Next MACS2 is used to call peaks and to estimate fragment size. In the last few steps, the coverage by estimated fragments is calculated from the BAM file and is reported in bigWig format. The pipeline also reports statistics, such as read quality, peak number and base frequency, as long as other troubleshooting information using tools such as Fastx-toolkit and Bamtools.

To get sample workflow with input data

$ git clone --recursive https://github.com/Barski-lab/ga4gh_challenge.git --branch v0.0.5
$ ./ga4gh_challenge/data/prepare_inputs.sh

Please, be patient it may take some time to clone submodule with input data. Runing the script prepare_inputs.sh will uncompress input FASTQ file.

To submit worflow for execution

cwl-airflow submit ga4gh_challenge/biowardrobe_chipseq_se.cwl ga4gh_challenge/biowardrobe_chipseq_se.yaml

To start Airflow Scheduler (don’t run it if cwl-airflow submit is used with -r argument)

airflow scheduler

To start Airflow web interface (by default it is accessible from your localhost:8080)

airflow webserver

Pipeline was tested with

macOS 10.13.6 (High Sierra)
Docker
- Engine: 18.06.0-ce
- Machine: 0.15.0
- Preferences
  - CPUs: 4
  - Memory: 2.0 GiB
  - Swap: 1.0 GiB
Elapsed time: 23 min (may vary depending on you Internet connection bandwidth, especially when pipeline is run for the first time and all Docker images are being fetched from DockerHub)