Welcome to DICAST!

The DICAST pipeline was initially designed to benchmark alternative splicing (AS) tools based on simulated “ground truth” reads produced by ASimulator. Here we provide a pipeline for running several mapping and AS event detection tools and evaluate and compare the results. DICAST however is not only suitable for simulated data, but you can also use it for real data.

We hope for this to be an open collaborative effort to represent and benchmark your tools. Please feel free to reach out to us, should your tool already be here and you have some edits to suggest. Also please reach out to us, if your Alternative splicing tool isn’t here and you’d like it to be.

To provide a fair baseline while maintaining easy usability, per default we run the tools with their default variables. If you feel like this is not doing your tool justice please contact us. The default parameters can be changed by editing the ENTRYPOINT.sh scripts of each tool.

The tools included here are the most widely used and well maintained among AS event detection tools. If you would like to include your tool in the pipeline please let us know. We hope this collection can be a starting point for future benchmarking approaches and quality control.

_images/pipeline.png

What is DICAST?

DICAST is a collection of alternative splicing event detection tools for analyzing RNA-Seq data. DICAST runs on Snakemake pipelines and relies on Docker based containerization. For easy installation and maintenance, we provide docker containers for every integrated tool at Dockerhub

DICAST can be run as a complete pipeline, starting with simulating RNA-Seq data with ASimulator, mapping the reads to a fasta reference, get information about AS with one or multiple tools and finally visualize and compare the results from different tools with DICAST unify.

Alternatively, you can run one of the same tools as a single docker container without snakemake.

DICAST is available for download on Github.

When should I use DICAST?

  • If you want to benchmark or compare different mapping and splicing tools with a genome and an annotation.

  • If you want to analyze your data with one or more tools included in DICAST and find out which tools give you the events you’re interested in.

How do I cite DICAST?

The preprint citing DICAST is Alternative splicing analysis benchmark with DICAST, is available for review on bioRxiv

If you use DICAST please cite the preprint as:

|Fenn, A.M., Tsoy, O., Faro, T., Roessler, F., Dietrich, A., Kersting, J., Louadi, Z., Lio, C.T., Voelker, U., Baumbach, J. and Kacprowski, T., 2022. Alternative splicing analysis benchmark with DICAST. bioRxiv.

Setup

DICAST has two main dependencies, Conda and Docker. Setting it up completely however, needs you to build docker images, which takes around 2.5 hours, locally, so it is recommended to get a bit of a headstart. It is faster to just pull them. Either way, should you run DICAST once, the images are cached on your computer and is accessible for later use.

Please follow these steps carefully to set up your working environment.

Clone Git Repository

If you don’t have git, please install git with the following Install Git.

Warning

Where you choose to install this git may be limited by “where can docker mount”.

Directories that can be mounted as Docker’s mounted volumes have permission based limitations. If you don’t feel confident about this section, please talk to your administrator.

Docker permission limitation:

Docker’s mounted volumes must hold the permissions: drwxrwxr-x. If you’re on a linux file system this means that all parent folders of your working directory must be more permissive than drwxrwsr-x, except “/”.

This can be ensured with the command for each of the parent directory until ‘/’.

chmod a+rX,u+w,g+w <directory hosting the DICAST git>

If you have sudo, consider /opt/DICAST/ as your working directory:

Cloning DICAST’s git repository:

Clone our project repository to a directory of your choice. This directory will be considered the working directory for most of the commands listed in this documentation.

git clone https://github.com/CGAT-Group/DICAST.git

This will give you access to the necessary scripts and the directory structure hosts an example for the inputs in a directory called “sample_input”. The directory structure within this git is assumed by DICAST, so please don’t modify directory names within this working directory.

Directory Structure

_images/dicast_input_folder_structure.png

Note

Our pipeline allows to run many different tools in the same way. The scripts therefore rely on the directory structure specified here. Please don’t rename any directories that are listed here within the git. An output directory is created with your first run. This directory may be renamed.

Example Tree Structure

This is an example for the tree structure when running the pipeline for alternative splicing. Please note that you only need a .fa and a .gtf file if you start your analysis with ASimulator, since it will create .fastq files for you. However, some tools need specific input files. Please refer to the respective tool documentation for further information.

input
├── casedir
│   ├── bamdir
│   └── fastqdir
├── controldir
│   ├── bamdir
│      ├── example_bbmap.sam
│      ├── example_contextmap.sam
│      ├── example_crac.sam
│      ├── example_dart.sam
│      ├── example_gsnap.sam
│      ├── example_hisat.sam
│      ├── example_mapsplice.sam
│      ├── example_minimap.sam
│      ├── example_segemehl.sam
│      ├── example_starAligned.out.sam
│      └── example_subjunc.sam
│   └── fastqdir
│       ├── example_1.fastq
│       └── example_2.fastq
├── bowtie_fastadir
│   ├── 10.fa
│   ├── 11.fa
│   ├── 12.fa
│   ├── 13.fa
│   ├── 14.fa
│   ├── 15.fa
│   ├── 16.fa
│   ├── 17.fa
│   ├── 18.fa
│   ├── 19.fa
│   ├── 1.fa
│   ├── 20.fa
│   ├── 21.fa
│   ├── 22.fa
│   ├── 2.fa
│   ├── 3.fa
│   ├── 4.fa
│   ├── 5.fa
│   ├── 6.fa
│   ├── 7.fa
│   ├── 8.fa
│   ├── 9.fa
│   ├── MT.fa
│   ├── X.fa
│   └── Y.fa
├── example.fa
├── example.gff
└── example.gtf

Input files

The sample_input is a template of what the files should look like. Let’s however compare this with a real world example. DICAST is not intended to be limited to any specific organism, but for examples, we go with the assembly you can download at NCBI homo sapiens.

example.fa : refers to a reference genome .fna or fa.

example.gff: refers to a reference annotation .gff or .gff3.

example.gtf: refers to a reference annotation .gtf.

Warning

All files, including references and fastq files must be unzipped, as most of the tools within dicast require them in unzipped form.

Note

If you work with the human genome or would like to just test if DICAST is installed well, check out the script at initializing-dicast.sh and execute it with the command bash initializing-dicast.sh, to populate your input directory with relavent human references.

Note

casedir, is currently unsupported. DICAST was built originally with a design that included tools for differential analysis. It maintains the directory structure in order to expand, to cover differential tools in the future.

example_*.sam: DICAST can map your fastq files for you with a mapper of your choice, the results of such mapping will be found here in this directory bamdir. If you have already mapped bam/sam files, place them in the input/controldir/bamdir, for DICAST to start from here.

Setup docker

1. Get docker-compose for your system (sudo required)

This requires you to have administrative rights on your computer, or talk to your system administrator about getting docker & docker-compose on your system and giving you rights to use the docker user group.

A. Download Docker

Follow the Docker Engine installation manual for getting Docker, your first step. : https://docs.docker.com/engine/install/

B. Post-install Docker steps

To run DICAST in user mode entirely, please fulfill this post-install step to run docker as a non-root user.

C. Install docker-compose

This while closely related, docker-compose is the last of DICAST’s docker dependencies. Follow the installation manual from docker-compose. https://docs.docker.com/compose/install/#install-compose

Basic Docker commands

Should you have docker configured on your system, you shouldn’t run into a permission error for the following commands.

docker images

to list docker images in on your computer.

docker ps

to list all running containers only.

docker ps -a

to list all running and stopped containers.

docker --version

We support Docker version 19 and above.

docker-compose --version

2. Pull docker images (Not needed with snakemake)

Should the tool you intended to run, not build locally, it’s also possible to pull them from DICAST’s dockerhub repository at: https://hub.docker.com/repository/docker/dicastproj/dicast

docker pull dicastproj/dicast:tagname

3. Build docker images (For Developers)

While the steps described in this section are handled by DICAST’s graphical interface, it can also be accessed via command line, for more control.

Build all images

If you intend to use multiple dockers at once you can use our snakemake pipeline, which will take care of building the docker images. If you want to build the dockers manually, we provide a docker-compose.yml file which will let you build them yourself. You can use the command the following command to build all images.

docker-compose -f scripts/Snakemake/docker-compose.yml build

If you’d like to edit DICAST’s docker-compose file, see the docker-compose Manual.

Build one image

If you only want to build one specific docker image, run the following command to first build some core essential containers:

docker-compose -f scripts/Snakemake/docker-compose.yml build base conda bowtie star

And if you want to build any of the other tools, use the following command:

docker-compose -f scripts/Snakemake/docker-compose.yml build <tool>

Where <tool> needs to be replaced with one or more of the following tools:

bbmap, contextmap, crac, dart, gsnap, hisat, mapsplice, minimap, segemehl, star, subjunc, asgal, aspli, eventpointer, irfinder, majiq, sgseq, spladder, whippet

4. Other helpful commands

To gracefully stop a running docker container (If perhaps snakemake’s process had to be killed):

docker stop <docker-container-name/ID>

Remove an image (to save space, after your analysis):

docker rmi -f <image id>

Install Snakemake

snakemake is the pipe-lining software that enables your DICAST runs. You can set up snakemake in a conda environment.

If you have never worked with conda before you might want to get conda first: https://conda.io/projects/conda/en/latest/user-guide/install/index.html

# create conda environment from .yml file with snakemake in it.
conda env create -f scripts/snakemake/dicast-snakemake.yml

# if you want to use DICAST, activate the "dicast-snakemake" environment
conda activate dicast-snakemake

If you want to learn more about snakemake, you can check out the snakemake documentation: snakemake.

Configuring DICAST

Before running DICAST, please take some time to configure it for your first run.

DICAST is best run with the GUI, which automates the configuration of scripts/snakemake/snakemake_config.yaml, scripts/config.sh & scripts/asevent_config.sh, for a quick run of DICAST without simulated data, on your experiments.

DICAST can be run via CLI, however, this feature is currently in development.

If you’d like to modify the Simulated dataset, please modify scripts/ASimulatoR_config.R (See ASimulatoR Parameters)

If you’d like to run DICAST with just one tool via docker, then the files you need to modify are: scripts/config.sh, scripts/asevent_config.sh

It’s recommended to take a closer look at the config files on disk before your first run.

The following files are all the configuration files found in DICAST:
scripts/snakemake/snakemake_config.yaml,
scripts/ASimulatoR_config.R,
scripts/config.sh &
scripts/asevent_config.sh.

Read the full reference here

Snakemake parameters

NEEDS edit Found in file: scripts/snakemake/snakemake_config.yaml

These parameters are either set in the GUI, or if you’re running DICAST via cli, these parameters determine properties your DICAST run.

Possible_overwrite_acknowledge:
When running DICAST first, an output directory is created. When your run is finished, please rename the output directory, if you want to save this output; so that you don’t overwrite outputs with the second run of DICAST.
ASimulatoR files such as src/ASimulatoR/out/event_annotation.tsv are also overwritten between runs, if ASimulatoR is run again.
true or false
true: DICAST runs uninterrupted.
false: DICAST run is interrupted until true
ASimulatoR:
do: True / False
Run ASimulatoR with the configs as stored in file: :scripts/snakemake/snakemake_config
Mapping_tools:
What_tools_to_run: ‘<insert name of mapping tools to run, separated by spaces>’
pick one of the following bbmap contextmap crac dart gsnap hisat mapsplice minimap segemehl star subjunc
Example: to run all tools: ‘bbmap contextmap crac dart gsnap hisat mapsplice minimap segemehl star subjunc’
Example: to some two tools: ‘minimap star’
Example: to run one tool: ‘star’
Alternative_splicing_detection_tools:
What_tools_to_run: ‘<insert name of Alternative Splicing tools to run, separated by spaces>’
pick one of the following asgal aspli eventpointer irfinder majiq sgseq spladder whippet
Example: to run all tools: ‘asgal aspli eventpointer irfinder majiq sgseq spladder whippet’
Example: to some two tools: ‘eventpointer whippet’
Example: to run one tool: ‘whippet’

ASimulatoR parameters

Found in file: scripts/ASimulatoR_config.R

Parameters are also explained in the following git github/biomedbigdata/ASimulatoR

ncores
Number of cores used by ASimulatoR
Within dicast, the max number of cores supplied to Snakemake, is as much is used within DICAST’s pipeline. ( See Snakemake parameters)
multi_events_per_exon
T or F
Should each exon be treated as a target for only one Alternative Splicing event or would you like to see events like Multiple Exon Skipping events, Alternative Last/First Exon + Exon Skipping events?
probs_as_freq
T or F
Default:
F: if probs_as_freq was FALSE, a random number would be drawn for each event-superset combination and only if it was smaller than 1/9 the AS event would be created
T: The exon supersets are partitioned corresponding to the event_prob parameter.
error_rate
Default: 0.001
In the uniform error model, probability that the sequencer records the wrong nucleotide at any given base.
readlen
Read Length
Default is 76
max_genes
define the number of genes you want to work with. If you want all exons, do not specify this parameter or set it to NULL
seq_depth
Sequencing depth of simulated experiment
num_reps
define, how many groups and samples per group you analyze. Here we create a small experiment with two groups with one sample per group:
as_events
make a list in R with the following set or a subset of the following:
c(‘es’, ‘mes’, ‘ir’, ‘a3’, ‘a5’, ‘afe’, ‘ale’, ‘mee’)
as_combs
Combinations of AS events desired in the simulated dataset.
event_probs
Event probabilities of AS events within the simulated dataset

Tool core parameters

Found in file: scripts/config.sh

Note

If a parameter is recommended as a default. It’s for the snakemake workflow to work smooth. Parameters with this value will be marked with: recommended to leave at default

Warning

If a parameter exists in the config files but isn’t listed in this reference, please don’t change the default on this paramenter.

Basic parameters
ncores
Number of cores or threads that each tool will use. Note when using a snakemake pipeline: the resulting number of cores used is a result of multiplication of ncores and snakemake -j parameter.
Default: 16
workdir recommended to leave at default
Name of the base directory inside the Docker.
Default: /MOUNT
outdir recommended to leave at default
Name of the output directory; should be named after the specific tool that was used (use the $tool variable for that).
Default: $workdir/output/${tool:-unspecific}-output
read_length
Length of the reads inside the fastq files.
Default: 76
Input Directories
inputdir recommended to leave at default
Base input directory.
Default: $workdir/input
controlfolder recommended to leave at default
Directory for all needed input files when no differential comparison. Directory for control sample input files when running differential AS event detection.
Default: $inputdir/controldir
casefolder recommended to leave at default
Directory only for case sample input files in case of differential AS event detection.
Default: $inputdir/casedir
fastqdir recommended to leave at default
Directory for fastq files.
Currently same as ‘controlfastq’
Default: $controlfolder/fastqdir
bamdir recommended to leave at default
Directory for bam files.
Currently same as ‘controlbam’
Default: $controlfolder/bamdir
samdir recommended to leave at default
Directory for sam files.
Default: $controlfolder/bamdir
fastadir
Directory for the reference genome file
Default: $inputdir
gtfdir
Directory for the annotation file file.
Default: $inputdir
gffdir
Directory for gff file
Default: $inputdir
Tool specific parameters
bowtie_fastadir
Some tools require chromosome-wise fasta-inputs
Default: $inputdir/fasta_chromosomes/
Index Parameters
recompute_index
Force recompute the index even if the index with $indexname already exists.
Default: false
indexname
Basename of the index (without eg. .1.bt2 for bowtie index).
Default: ${fastaname}_index
star_index
Folder containing a star index built with the $gtf and $fasta files (see below), used by: IRFinder, KisSplice, rMATS
Default: $workdir/index/star_index
indexdir
Directory of the index.
Default: $workdir/index/${tool:-unspecific}_index
ASimulatoR Parameters
asimulator_gtf
Name of the file in the input directory used by ASimulatoR to generate new transcripts
Example: Homo_sapiens.GRCh38.105.gtf
Input Parameters
fastaname
Name of the genome reference file (fasta format) inside $fastadir.
Example: Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtfname
Name of annotation reference file inside $gffdir.
Example: splicing_variants.gtf
gffname
Name of gff reference file inside $gffdir.
Example: splicing_variants.gff3

Note

There should be no need to edit fasta, gtf and gff since they just combine other parameters.

fasta
Full path to the reference genome file.
Default: ${fastadir:-unspecific}/$fastaname
gtf
Full path to the annotation file.
Default: ${gtfdir:-unspecific}/$gtfname
gff
Full path to the gff file.
Default: ${gffdir:-unspecific}/$gffname
Basic Mapping Parameters
outname recommended to leave at default
Base name of the output files. They will usually be prefixed with the fastq file name and suffixed with .sam.
Default: $tool (the name of the tool creating the ouput files)

Warning

Something broke while changing the config file? Make sure there is no space between the variable, the equal sign and the value.

Since these files are bash scripts, it is important to mind the syntax rules. E.g., there can’t be a whitespace before and after “=”.

For example: | Wrong: workdir = “dockers/” | Right: workdir=”dockers/”

Alternative splicing tools parameters

Found in file: scripts/asevent_config.sh

This config file sets parameters that are specific to AS event detection tools only.

Basic Parameters
transcript
Fasta file for gene transcripts.
Default: $fasta (set in config.sh, see the General Parameters)
star_alignment_files
Path to the folder containing star alignment files (*.SJ.)
Default: $workdir/output/star-output

Note

We support only paired RNA-Seq - fastq files have to be in pairs.
Set the suffixes parameters (including the file extension) for all fastq pairs (e.g. _1.fastq and _2.fastq).
fastqpair1suffix
Suffix for the first file of the fastq pair.
Example: _1.fastq
fastqpair2suffix
Suffix for the second file of the fastq pair.
Example: _2.fastq
use_bam_input_files
Determines what kind of input to use: 1 for bam files, 0 for fastq files.
Default: 0
combine_events
Events such as Multiple Exon Skipping should be represented as such, instead of individual exon skipping events.
Default: 1

Warning

Something broke while changing the config file? Make sure there is no space between the variable, the equal sign and the value.

Since these files are bash scripts, it is important to mind the syntax rules. E.g., there can’t be a whitespace before and after “=”.

For example: | Wrong: workdir = “dockers/” | Right: workdir=”dockers/”

Most frequent changes to configurations:

Note

The following explanations assume that you use our directory structure as described in Directory Structure.

1. scripts/snakemake/snakemake_config.yaml

The following are the snakemake parameters that you’re most likely to change for a CLI run:.

Possible_overwrite_acknowledge:
do: false
change to true. This is set to false after every run to prevent overwriting of output files
Mapping_tools:
What_tools_to_run: ‘<insert name of mapping tools to run, separated by spaces>’
pick one of the following bbmap contextmap crac dart gsnap hisat mapsplice minimap segemehl star subjunc
Example: to some two tools: ‘minimap star’
Example: to run all tools: ‘bbmap contextmap crac dart gsnap hisat mapsplice minimap segemehl star subjunc’
Example: to run one tool: ‘star’
Alternative_splicing_detection_tools:
What_tools_to_run: ‘<insert name of Alternative Splicing tools to run, separated by spaces>’
pick one of the following asgal aspli eventpointer irfinder majiq sgseq spladder whippet
Example: to run all tools: ‘asgal aspli eventpointer irfinder majiq sgseq spladder whippet’
Example: to some two tools: ‘eventpointer whippet’
Example: to run one tool: ‘whippet’

2. scripts/config.sh

The following are basic parameters that you are most likely to change on the GUI and in the file scipts/config.sh.

Warning

Since these files are bash scripts, it is important to mind the syntax rules. E.g., there can’t be a whitespace before and after “=”.

Basic Parameters

ncores
if you want to use more cores for each tool within snakemake. (not the same as total cores available for snakemake -j 2)

Input Parameters

asimulator_gtf
the genome gtf annotation that you use to simulate the data. Default: ‘Homo_sapiens.GRCh38.105.gtf’.
fastaname
the genome reference file. Default: ‘Homo_sapiens.GRCh38.dna.primary_assembly.fa’.
gtfname
the genome gtf annotation that you use for mapping and alternative splicing analysis. If you’re using ASimulatoR, leave this as ASimulatoR.gtf.
gffname
the genome gff3 annotation that you use for mapping and alternative splicing analysis. If you’re using ASimulatoR, leave this as ASimulatoR.gff3.

The reference genome, annotation file and gff3 files could be downloaded from Ensembl.

Run your analysis

Note

This guide page assumes that you have followed all four pages of the setup carefully and that you have configred DICAST.

Run DICAST via Command Line Interface

In this section we will explain how to use DICAST to run a whole pipeline on the terminal alone.

Make sure you followed the steps described in the setup section carefully.

Change config.sh according to your run (see How to change your config.sh file)

Before getting started make sure to activate the snakemake conda environment:

conda activate dicast-snakemake

Note

Snakemake is set up to run all tools in parallel, meaning if the pipeline is run unrestricted, it will use all available cores. Be sure to set the number of cores to limit the resources available to Snakemake (explained below).

To run the snakemake pipeline:
  • Go to /path/to/DICAST/scripts/snakemake

  • Edit snakemake_config.yaml to list the tools you want to run in the corresponding lines.

  • Use a snakemake command. For example:

See snakemake -h
E.g.:
snakemake -j 2 -d /opt/DICAST/ -s /path/to/DICAST/scripts/snakemake/Snakefile-cli --configfile /path/to/DICAST/scripts/snakemake/snakemake-cli_config.yaml

Important arguments:

Argument

Explanation

-j, --cores <n>

Set the number of cores used by the pipeline.

-s, --snakefile <file>

Set the path to the Snakemake file.

-d, --directory <path>

Set the path to the working directory (containing the input, scripts folder etc.)

--configfile <file>

Set the path to the configuration file which specifies what part of the pipeline to run.

For more information, see the Snakemake documentation.

Warning

This feature has been depreciated and needs another Snakefile, without the Possible_overwrite_acknowledge rule. We keep this documentation here, for future support and to show you how the tool works under the hood.

Troubleshooting

  • Check log files under output/<tool>-output/logs/

  • If a run was canceled/exited unexpectetly and the directory is still locked, try running the same snakemake command again with --unlock or remove the files from working_directory/.snakemake/locks/

  • Snakemake itself creates some logging files, they can be found in working_directory/.snakemake/log/

Run one specific tool via Docker

Change config.sh according to your run (see How to change your config.sh file)

If you have already built the image with <tool>:<tag> (see the docker setup) you can run the following command to run the image and start the tool:

docker run -v <your mounted folder>:/MOUNT --user $(id -u):$(id -g) <tool>:<tag>

# Examples:
# If you are using our directory structure for your input and are in the dockers directory:
# Add the --rm flag to remove container, after run.

docker run -v $(pwd):/MOUNT --user $(id -u):$(id -g) --rm gsnap:0.1

Troubleshooting

  • Check Snakemake Output to see which rule failed.

  • if the rule that failed was named after a tool, check log files under output/<tool>-output/logs/ to see where the error was.

Run DICAST with a graphical user interface (depreciated)

Make sure you are in the working directory and that listing your directory looks like the directory structure mentioned in directory structure. To run dicast, activate the dicast-snakemake conda environment:

conda activate dicast-snakemake

Your prompt should show you (dicast-snakemake), to show you your conda environment. If so, start DICAST with the following command.

python gui/dicast.py
_images/Screenshot_dicast.png

Warning

DICAST is set to Run ASimulatoR with default values, however, should you wish to tune simulated dataset to your investigative questions, please modify the file scripts/ASimulatoR_config.R ( See ASimulatoR Parameters )

Note

The graphical user interface assumes a X11 rendering system. If you’re using ssh, please use the ssh -X flag, to allow X11 forwarding. If you’re not on a Linux machine, locally, find out how to host an X11 daemon for yourself. If you’re on a Mac, this could mean installing Xquartz on your localmachine. If you’re running DICAST on your local linux machine, the output of echo $DISPLAY should read :0; this suggests that you have X11 forwarded correctly to your localmachine.

Options

Explanation

Select working directory

The working directory is where you’ve hosted your DICAST git and it looks like the described directory structure needed to run DICAST. This acts as the root directory for the project.

Select custom Snakefile

By default, DICAST stores it’s Snakefile under scripts/snakemake/Snakefile. Moving the snakefile is not advised.

Possible overwrite acknowledgement

DICAST writes all outputs to output directory, generated by DICAST. If such a directory exists, maybe rename it, before starting a new run, or you’ll loose data. ASimulatoR files such as src/ASimulatoR/out/event_annotation.tsv are also overwritten between runs.

Number of cores available to Snakemake

Total number of cores given to DICAST. Minimum advised: 2

Do you want to run ASimulatoR?

ASimulatoR comes as a part of DICAST. ASimulatoR can be configured by modifying file at src/ASimulatoR/runASimulatoR.R

Referesh status

Is useful, if you connect to DICAST and a previous run is still continuing. DICAST catches previous run and updates your progress.

Abort

Interrupt a running DICAST execution.

Clean up

Cleans up intermediate snakemake files from incomplete runs, use after Abort.

Which Mapping/AS tools:

Select the tools you want to run within DICAST and click okay.

OK

Starts a DICAST run.

Close

Shuts down the GUI, leaving a running session of DICAST, you can connect to next time you start the gui.

DICAST will continue to run, and you can safely close the GUI. Re-opening the GUI connects you back to a running instance of DICAST, if it isn’t finished already.

Custom Configuration file:

With this, custom configure your dicast run

_images/Screenshot_dicast_config.png

Options

Explanation

Number of cored for each tools

Usually seen as –ncores for many tools. gives the option of efficient parallelization when possible.

Read length

An option seen by quite a few tools. supported length <200. Not every tool has been tested with varying read lengths.

Fasta name

Browse to select your reference genome

GTF name

Browse to select your reference annotation, if you’re using a real dataset. If you’re using ASimulatoR, leave this as ASimulatoR.gtf

GFF name

Browse to select your reference annotation, if you’re using a real dataset. If you’re using ASimulatoR, leave this as ASimulatoR.gff3

First fastq pair suffix

How do you differentiate paired end reads? how does each fastq file end? “1.fastq”?

Second fastq pair suffix

How do you differentiate paired end reads? how does each fastq file end? “2.fastq”?

Base output directory

Please leave this at default

Use BAM files as inputs

Some tools give you the option of starting from fastq files or from mapped files.

Should different AS events be combined into one

Multiple exon skipping, could be considered one or many AS events. How do you prefer reporting them?

Note

the close button warns you about losing changes even if you saved them. This is a bug, and will be corrected soon.

Troubleshooting

  • Check Snakemake Output to see which rule failed.

  • if the rule that failed was named after a tool, check log files under output/<tool>-output/logs/ to see where the error was.

Warning

Aborting a run: Once the dockers begin, they’re not under DICAST’s control to abort. If you really want to interrupt DICAST, also check for running containers docker ps and stop/ kill running containers with docker stop <container-name>. Also use the clean up function to clean up an interrupted run.

Interrupting a DICAST run

If you want to interrupt a DICAST run. Click on the Abort button and then click on the Clean up button. DICAST unfortunately doesn’t show you that this is a required step, so until Clean up is clicked upon, your next run will not start. Your configurations should stay as you set them last. Click on Acknowledge overwrite checkbox and you’re all set for the next run.

DICAST Outputs

DICAST provides outputs as you would expect them from each Alternative Splicing tool within output/<astoolname>-output/<Fastq-filename>_output.

DICAST also provides a output/<astoolname>-output/<Fastq-filename>_output_dicast_unified output format for each tool. This is a simple tsv file that hosts all the events found from each tool. We used this to unify the outputs needed to build each of the plots outputted by DICAST.

output/
├── <astoolname>-output
   ├── logs
   ├── <Fastq-filename>_output
   ├── <Fastq-filename>_output_<astoolname>_dicast_unified
└── plots
    └── <Fastq-filename>
        ├── <mapping tool>-name
           ├── A3_compare.png
           ├── A5_compare.png
           ├── AFE_compare.png
           ├── ALE_compare.png
           ├── ES_compare.png
           ├── IR_compare.png
           ├── MEE_compare.png
           ├── MES_compare.png
           └── overall_compare.png
        └── unmapped
            ├── A3_compare.png
            ├── A5_compare.png
            ├── AFE_compare.png
            ├── ALE_compare.png
            ├── ES_compare.png
            ├── IR_compare.png
            ├── MEE_compare.png
            ├── MES_compare.png
            └── overall_compare.png

DICAST also outputs an UpSet plot for each Fastq-filename-mapping_tool combination.

_images/upset_plot.png

This plot shows the events that were found in common by tools and shows you which tools found these events as well.


When run with ASimulatoR, DICAST also outputs precision and recall plots for each Fastq-filename-mapping_tool combination.;

for all events

_images/overall_compare.png

and for each event

_images/ES_compare.png

Workflow

DICAST pipeline. (1) Simulation (2) Mapping (3) Splicing (4) Unification

To run the entire pipeline, you need a reference genome file and a annotation file file of your organism. If you do not want to work with simulated data, you can enter each step with your own data. E.g. you can enter step 2 or 3A with fastq files from your own experiment or step 3B with bam files from your own mapping tool. Please note that not all mapping and splicing detection tools are compatible with each other and have different file requirements (e.g. reference genome, annotation file, gff). For further information, please refer to the tool-specific DICAST documentation.

Warning

We tried our best to unify the input that is required for all tools. This did not work for all tools. When a tool requires custom input you will see a warning like this on the concerning documentation page.

General Information

Not all mapping and splicing tools are compatible with each other. Please refer to the table below to see which tools you can use together successfully. fastq only tools do only work with fastq files and not with bam files and therefore don’t depend on a mapping tool.

Compatibility

asgal

aspli

eventpointer

irfinder

majiq

sgseq

spladder

whippet

bbmap

fastq only

Yes

No

Yes

No

No

Yes

Yes

contextmap

fastq only

Yes

Yes

Yes

No

No

Yes

Yes

crac

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

dart

fastq only

Yes

No

Yes

Yes

No

Yes

Yes

gsnap

fastq only

No

No

Yes

No

No

Yes

Yes

hisat

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

mapsplice

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

minimap

fastq only

No

No

Yes

Yes

No

Yes

Yes

segemehl

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

star

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

subjunc

fastq only

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Mapping tools

Note

Most mapping tools need an index file of the reference genome for mapping. The computation of these index files can take a long time. Our mapping tool scripts check if there already is an index for the respective tool and only build it, if it is not found. If you face any index related errors, either set the parameter $recompute_index=True or delete the old index to recalculate it.

Mapping Input Files

Tip

The paths assume you are using our suggested input structure. Example input files you can find in our examples section.

You can find the required input files in the tool-specific documentation.

fastq

Fastq files for paired end mapping. The directories are separated in controldir and casedir. The controldir is the default folder for all analyses. The casedir is only used for differential splicing analysis.

input/fastq/controldir/*yourFastqFile1*_1.fastq
input/fastq/controldir/*yourFastqFile1*_2.fastq
input/fastq/controldir/*yourFastqFile2*_1.fastq
input/fastq/controldir/*yourFastqFile2*_2.fastq
. . .
fasta

The fasta reference for your organism. Mapping tools usually only need it for indexing (see tool specific documentation).

input/*yourFastaFile*.fa
gtf

annotation reference file.

input/*yourGTFfile*.gtf
bowtie_fastadir

Only needed by some tools. Chromosome-wise fasta files for your organism to build an index with bowtie.

input/bowtie_fastadir/1.fa
input/bowtie_fastadir/2.fa
input/bowtie_fastadir/3.fa
input/bowtie_fastadir/4.fa
. . .
input/bowtie_fastadir/X.fa
input/bowtie_fastadir/Y.fa
Optional: Index

Tool specific index file(s). If no index file is found in the index folder it will be built the first time you run the tool. This might take some time. If you want to provide your own index please make sure it is in the correct format and file names. Since the index is usually built based on the fasta reference we recommend to name the index based on the fasta reference (default). You can change the indexname variable in the config script.

index/*toolname*-index/*yourIndexBaseName*

Parameters

To provide a fair baseline while maintaining easy usability, per default we run the tools with their default variables. The default parameters can be changed by editing the ENTRYPOINT.sh scripts of each tool. The variables used by mapping ENTRYPOINT.sh scripts can be set in the config.sh and mapping_config.sh files in the scripts folder. For a usual analysis you should not need to change these parameters.

BBMap

BBMap uses a multi kmer seed and extend strategy for read mapping.

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/bbmap-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/bbmap/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the BBMap manual.

-in

Fastq filename of paired end read 1.

-in *yourFastqFile1_*1.fastq
-in2

Fastq filename of paired end read 2.

-in2 *yourFastqFile1_*2.fastq
-ref

Reference genome in fasta format.

-ref $fasta
-path

Base name of the index folder and files.

-path $indexdir/$indexname
-intronlen

Length of introns.

-intronlen 20
-xstag

Add sam flags to improve compatibility with alternative splicing tools.

-xstag us
-outm

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-outm $outdir/$controlfolder/*yourFastqFile1_*bbmap.sam
-outu

The path to the unmapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-outu $outdir/$controlfolder/*yourFastqFile1_*.bbmap_unmapped.sam
Known Issues
  • issue

  • another

  • another

ContextMap 2.0

Warning

Make sure to put the file jre-8u241-linux-i586.tar.gz in the same folder as the ContextMap Dockerfile (src/contextmap2/).nYou can get the file here: https://www.oracle.com/java/technologies/javase/javase8u211-later-archive-downloads.html

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/contextmap-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/contextmap/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the ContextMap 2.0 manual.

-reads

Comma separated list of file paths to reads in fastq format. One pair of fastq files for paired-end mapping

-reads *yourFastqFile1_*1.fastq,*yourFastqFile1_*2.fastq
-aligner_name

Used aligner (index tool). We use bowtie2.

-aligner_name bowtie2
-aligner_bin

Path to the used aligner. If you use our docker you will not have to wolly about it.

-aligner_bin /home/biodocker/bin/bowtie2
-indexer_bin

Path to the indexing tool of the aligner.

-indexer_bin /home/biodocker/bin/bowtie2-build
-indices

Comma separated list to your index files base names.

-indices *IndexChromosome1*,*IndexChromosomes2*,*IndexChromosome3*, . . .
-genome

Directory path with chromosome-wise fasta files.

-genome $bowtie_fastadir

CRAC

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname.ssa exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/crac-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/crac/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the CRAC manual.

-i

Base name of the index folder and files.

-i $indexdir/$indexname
-k

Number of k-mers to be used. 22 is the recommended number for human genome.

-k 22
-r

Space separated list of file paths to reads in fastq format. One pair of fastq files for paired-end mapping

-r *yourFastqFile1_*1.fastq *yourFastqFile1_*2.fastq
-o

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*crac.sam

—detailed-sam Return a detailed sam file as output.

--stranded

Reads are from a strand specific RNA-seq protocol.

Dart

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname.sa exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/dart-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/dart/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the Dart manual.

-i

Base name of the index folder and files.

-i $indexdir/$indexname
-f

Fastq filename of paired end read 1.

-f *yourFastqFile1_*1.fastq
-f2

Fastq filename of paired end read 2.

-f2 *yourFastqFile1_*2.fastq
-o

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*dart.sam

GSNAP

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname/$indexname.contig exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/gsnap-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/gsnap/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the GSNAP manual.

--db

Base name of the index folder and files.

--db $indexdir/$indexname

—dir Base folder of the index files.

--dir $indexdir

—output-file The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

--output-file $outdir/$controlfolder/*yourFastqFile1_*gsnap.sam

—format Define output format (one of sam, m8).

--format sam

—force-xs-dir Add sam flags to improve compatibility with alternative splicing tools.

--force-xs-dir us

—nthreads Number of threads to be used during the computation

--nthreads $ncores
reads

After all other options call space separated list of file paths to reads in fastq format. One pair of fastq files for paired-end reads.

*yourFastqFile1_*1.fastq *yourFastqFile1_*2.fastq

HISAT2

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/${indexname}.4.ht2 and $indexdir/${gtfname}_splicesites.txt exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/hisat-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/hisat/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the HISAT2 manual.

--x

Base name of the index folder and files.

--x $indexdir/$indexname
-1

Fastq filename of paired end read 1.

-1 *yourFastqFile1_*1.fastq
-2

Fastq filename of paired end read 2.

-2 *yourFastqFile1_*2.fastq
-S

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-S $outdir/$controlfolder/*yourFastqFile1_*hisat.sam

—known-splicesite-infile Provide a list of known splice sites.

--known-splicesite-infile $indexdir/$indexname/splicesites.txt
-q

Activate quiet mode so only error messages are printed.

MapSplice 2

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname.rev.2.ebwt exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/mapsplice-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/mapsplice/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the MapSplice 2 manual.

-c

Directory path with chromosome-wise fasta files.

-c $bowtie_fastadir
-x

Base name of the index folder and files.

-x $indexdir/$indexname

—gene-gtf The path to the gene annotation file in GTF format for annotation of fusion junctions.

--gene-gtf $gtf
-o

The path to the directory for the mapped output in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*mapsplice
-p

Number of threads to be used during the computation

-p $ncores
-1

Fastq filename of paired end read 1.

-1 *yourFastqFile1_*1.fastq
-2

Fastq filename of paired end read 2.

-2 *yourFastqFile1_*2.fastq

Minimap2

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/minimap-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/minimap/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the Minimap2 manual.

-a

Generate CIGAR and provide output in sam format.

-o

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*minimap.sam
-t

Number of threads to be used during the computation

-t $ncores
index

Base name of the index folder and files.

$indexdir/$indexname
reads

After all other options call space separated list of file paths to reads in fastq format. One pair of fastq files for paired-end reads.

*yourFastqFile1_*1.fastq *yourFastqFile1_*2.fastq

segemehl

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/segemehl-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/segemehl/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the segemehl manual.

-d

Reference genome in fasta format.

-d $fasta
-q

Fastq filename of paired end read 1.

-q *yourFastqFile1_*1.fastq
-q

Fastq filename of paired end read 2.

-q *yourFastqFile1_*2.fastq
-i

Base name of the index folder and files.

-i $indexdir/$indexname

—splits Use split reads alignment

-o

The path to the mapped output file in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*segemehl.sam
-t

Number of threads to be used during the computation

-t $ncores

STAR

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $star_index/$indexname/genomeParameters.txt exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/star-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/star/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the STAR manual.

--sjdbGTFfile

The path to the gene annotation file in GTF format for annotation of fusion junctions.

--sjdbGTFfile $gtf

—readFilesIn Space separated list of file paths to reads in fastq format. One pair of fastq files for paired-end mapping

--readFilesIn *yourFastqFile1_*1.fastq *yourFastqFile1_*2.fastq

—genomeDir Base name of the index folder and files.

--genomeDir $indexdir/$indexname

—outFileNamePrefix The path to the directory for the mapped output in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

--outFileNamePrefix $outdir/$controlfolder/*yourFastqFile1_*star

—runTreadN Number of threads to be used during the computation

--runTreadN $ncores

—twopassMode Basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly

--twopassMode Basic

—outSAMstrandField Add strand derived from the intron motif.

--outSAMstrandField intronMotif

—outSAMattributes Add sam flags to improve compatibility with alternative splicing tools.

--outSAMattributes  us

Subjunc

Indexing

Note

Indexing might take some time but only has to be run once per fasta file. Make sure to reuse already computed indices if possible.

DICAST will check if $indexdir/$indexname.reads exists. If there is no index it will be automatically built. If you want to rebuild the index anyway set $recompute_index=true in scripts/mapping_config.sh. If you want to use your own precomputed index file copy it to index/subjunc-index/ and make sure the index is complete and named appropriately and according to the parameters set in the config files. We recommend including the name of the fasta file in the index name to avoid overwriting. Per default this is already the case and no parameter changes are needed.

Parameters

These are the default parameters set in the src/subjunc/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the Subjunc manual.

-i

Base name of the index folder and files.

-i $indexdir/$indexname
-r

Fastq filename of paired end read 1.

-r *yourFastqFile1_*1.fastq
-R

Fastq filename of paired end read 2.

-R *yourFastqFile1_*2.fastq
-o

The path to the directory for the mapped output in sam format. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir/$controlfolder/*yourFastqFile1_*subjunc
-T

Number of threads to be used during the computation

-T $ncores

—SAMoutput Return a sam file as output.

Splicing tools

Warning

Currently only alternative splicing event detection is supported. Differential splicing tools are coming soon. The differential splicing function of tools which are able to compute both alternative and differential splicing the differential mode is still in beta.

For splicing tools we differentiate between alternative and differential splicing tools. Some tools are able to compute both. Differential splicing tools compute alternative splicing for two conditions (e.g. case and control) and the files should be separated as indicated by our input directory structure. For alternative splicing analysis “control” is the default.

Splicing Input Files

Tip

The paths assume you are using our suggested input structure. Example input files you can find in our examples section.

You can find the required input files in the tool-specific documentation.

fastq

Fastq files for pair 1 and 2 fastq files stored in $fastqdir, identified by the suffix $fastqpair1suffix and $fastqpair2suffix respectively. Not all splicing tools work with fastq files. The path variables can be found in scripts/config.sh and scripts/asevent_config.sh. For differential splicing the files need to be separated in controldir and casedir

# Fastq file paths
# Assumed variable settings:
#    $fastqdir=input/fastq   ## in config.sh
#    $fastqpair1suffix="_1.fastq"   ## in asevent_config.sh
#    $fastqpair2suffix="_2.fastq"   ## in asevent_config.sh
# Replace the text between the stars *...* with your file names

input/controldir/fastq/*yourFastqFile1*_1.fastq
input/controldir/fastq/*yourFastqFile1*_2.fastq
input/controldir/fastq/*yourFastqFile2*_1.fastq
input/controldir/fastq/*yourFastqFile2*_2.fastq
. . .
bam

Bam files created by a mapping tool of your choice. When DICAST is run as a pipeline, these will be created by the selected mapping tool(s).

input/controldir/fastq/*yourFastqFile1*_1.fastq
fasta:

The name of the reference fasta file. The path variable can be found in scripts/config.sh.

# Fasta files paths
# Replace the text between the stars *...* with your file name

input/*yourFastaFile*.fa
transcript

The name of the fasta file for gene transcripts. The path variable can be found in scripts/asevent_config.sh.

# Assumed variable settings:
#    $inputdir=input   ## in config.sh

input/*yourTranscriptFasta*.fasta
gtf

Gene annotation file in GTF format.

# Replace the text between the stars *...* with your file name

input/*yourGTFfile*.gtf
gff

Gene annotation file in GFF format.

# Replace the text between the stars *...* with your file name

input/*yourGFFfile*.gff

Parameters

To provide a fair baseline while maintaining easy usability, per default we run the tools with their default variables. The default parameters can be changed by editing the ENTRYPOINT.sh scripts of each tool. The variables used by mapping ENTRYPOINT.sh scripts can be set in the config.sh and asevent_config.sh files in the scripts folder. For a usual analysis you should not need to change these parameters.

ASGAL

Warning

ASGAL requires the variables $fastqpair1suffix and $fastqpair2suffix to be set in the scripts/asevent_config.sh file.

Parameters

These are the default parameters set in the src/asgal/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the ASGAL manual.

--multi

Set multi option.

-g

Reference genome in fasta format.

-g $fasta
-a

The path to the gene annotation file in GTF format for annotation of fusion junctions.

-a $gtf
-t

Transcript file.

-t $transcript
-s

Fastq filename of paired end read 1.

-s *yourFastqFile1_*1.fastq
-s2

Fastq filename of paired end read 2.

-s2 *yourFastqFile1_*2.fastq
-o

Output directory. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-o $outdir
-@

Number of threads to be used during the computation

-@ $ncores
--allevents

Report all events, not only novel ones.

Aspli

Note

Aspli can be used to calculate differential splicing as well as only alternative-splicing events. If you want to perform differential analysis set differential=1 in the /scripts/asevent_config.sh config file.Otherwise set differential=0.

Note

Aspli is an R package. Therefore our ENTRYPOINT.sh script for Aspli calls an R script to run the tool. The parameters listed here are the parameters given to the R script.

Parameters

These are the default parameters set in the src/aspli/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the Aspli manual.

--gtf

The path to the gene annotation file in GTF format for annotation of fusion junctions.

--gtf $gtf
--cores

Number of threads to be used during the computation

--cores $ncores
--readLength

Length of reads.

--readLength $read_length
--out

Output directory. The output will be separated into case and control folder based on the basefolder of the according bam file. If you are running the DICAST pipeline to compare different mapping tools this will include the name of the mapping tool of the used bam file.

--out $outdir
--differential

1 to run differential analysis, 0 otherwise.

--differential $differential

EventPointer

Note

EventPointer can be used to calculate differential splicing as well as only alternative-splicing events. If you want to perform differential analysis set differential=1 in the /scripts/asevent_config.sh config file.Otherwise set differential=0.

Note

EventPointer is an R package. Therefore our ENTRYPOINT.sh script for EventPointer calls an R script to run the tool. The parameters listed here are the parameters given to the R script.

Parameters

These are the default parameters set in the src/eventpointer/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the EventPointer manual.

--gtf

The path to the gene annotation file in GTF format for annotation of fusion junctions.

--gtf $gtf
--cores

Number of threads to be used during the computation

--cores $ncores
--out

Output directory. The output will be separated into case and control folder based on the basefolder of the according bam file. If you are running the DICAST pipeline to compare different mapping tools this will include the name of the mapping tool of the used bam file.

--out $outdir
--bamfolder

Location of bam files.

--bamfolder $controlfolder
--differential

1 to run differential analysis, 0 otherwise.

--differential $differential

IRFinder

Note

IRFinder can use both fastq and bam files. To use bamfiles please set the parameter $use_bam_input_files=1, and =0 to use fastq files in the as_config.sh script.

Parameters

These are the default parameters set in the src/irfinder/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the IRFinder manual.

-r

Base folder of the index files.

-r $indexdir
-d

Output directory. The output will be separated into case and control folder based on the basefolder of the according fastq file.

-d $outdir
reads

After all other options call space separated list of file paths to reads in fastq format. One pair of fastq files for paired-end reads.

*yourFastqFile1_*1.fastq *yourFastqFile1_*2.fastq

MAJIQ

Note

MAJIQ can be used to calculate differential splicing as well as only alternative-splicing events. If you want to perform differential analysis set differential=1 in the /scripts/asevent_config.sh config file.Otherwise set differential=0.

Parameters

These are the default parameters set in the src/majiq/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the MAJIQ manual.

build reference

The path to the gene annotation file in GFF format.

$gff
-c

MAJIQ config file (built based on DICAST config parameters in ENTRYPOINT.sh)

-c $config
-j

Number of threads to be used during the computation

-j $ncores
-o

Output directory with majiq build output.

-o $outdir/$outdir_name/build
maqiq psi

Run MAJIQ in psi mode with files built from gff as input.

-j

Number of threads to be used during the computation

-j $ncores
-o

Output directory with psi output. Used to build splicegraph with voila.

-o $outdir/$outdir_name/psi
-n

Run with bam files as input.

-n “BAM”

SGSeq

Note

SGSeq is an R package. Therefore our ENTRYPOINT.sh script for SGSeq calls an R script to run the tool. The parameters listed here are the parameters given to the R script.

Parameters

These are the default parameters set in the src/sgseq/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the SGSeq manual.

--gtf

The path to the gene annotation file in GTF format for annotation of fusion junctions.

--gtf $gtf
--path_to_bam

Name of bamfile.

--path_to_bam $controlfolder/*filename*.bam
--out

Output directory. The output will be separated into case and control folder based on the basefolder of the according bam file. If you are running the DICAST pipeline to compare different mapping tools this will include the name of the mapping tool of the used bam file.

--out $outdir
--cores

Number of threads to be used during the computation

--cores $ncores

SplAdder

Note

SplAdder can be used to calculate differential splicing as well as only alternative-splicing events. If you want to perform differential analysis set differential=1 in the /scripts/asevent_config.sh config file.Otherwise set differential=0.

Parameters

These are the default parameters set in the src/spladder/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the SplAdder manual.

-b

Name of bamfile.

-b $controlfolder/*filename*.bam
-o

Output directory. The output will be separated into case and control folder based on the basefolder of the according bam file. If you are running the DICAST pipeline to compare different mapping tools this will include the name of the mapping tool of the used bam file.

-o $outdir
-a

The path to the gene annotation file in GTF format for annotation of fusion junctions.

-a $gtf
--parallel

Number of threads to be used during the computation

--parallel $ncores
-n

Length of reads.

-n $read_length
--output-txt-conf

Output in txt format.

Whippet

Note

Whippet can be used to calculate differential splicing as well as only alternative-splicing events. If you want to perform differential analysis set differential=1 in the /scripts/asevent_config.sh config file.Otherwise set differential=0.

Parameters

These are the default parameters set in the src/whippet/ENTRYPOINT.sh script. If you want to change it you can do this in the ENTRYPOINT script directly. Please refer to the Whippet manual.

--fasta

Reference genome in fasta format.

--fasta $fasta
--gtf

The path to the gene annotation file in GTF format for annotation of fusion junctions.

--gtf $gtf
--bam

Name of bamfile.

--bam $controlfolder/*filename*.bam
-x

Output directory for whippet index.

-x $outdir/*bamfilename*/graph
-o

Output directory. The output will be separated into case and control folder based on the basefolder of the according bam file. If you are running the DICAST pipeline to compare different mapping tools this will include the name of the mapping tool of the used bam file.

-o $outdir

Examples

Here we provide the detailed description of possible workflows. We recommend to run analysis using a terminal multiplexer, e.g. tmux or screen.

Running multiple mapping tools (E.g., STAR, HISAT2 and bbmap)

  1. Make sure you followed the steps described in the setup section carefully.

  2. Before getting started make sure to activate the snakemake conda environment:

conda activate dicast-snakemake
  1. Create the input folder:

cd /path/to/DICAST/
mkdir input
  1. Create the directory structure as in the sample_output:

cd input
mkdir controldir
cd controldir
mkdir fastqdir
  1. Download or copy the genome fasta file into the input folder. Dont’t forget to uncompress it. E.g.:

cd /path/to/DICAST/input
wget http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  1. Download or copy the genome gtf annotation into the input folder. Dont’t forget to uncompress it. E.g.:

wget http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
gunzip Homo_sapiens.GRCh38.105.gtf.gz
  1. Download or copy the fastq files you want to align into the /path/to/DICAST/input/controldir/fastqdir. Note: we support only paired-end RNA-Seq - fastq files have to be in pairs.

  2. Go to /path/to/DICAST/scripts and edit config.sh according to your run (see How to change your config.sh file):

cd /path/to/DICAST/scripts
nano config.sh

In the config.sh file edit the following lines:

read_length=76
fastaname=Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtfname=Homo_sapiens.GRCh38.105.gtf
  1. List the mapping tools you want to run:

cd /path/to/DICAST/scripts/snakemake/
nano snakemake_config.yaml

In the snakemake_config.yaml file edit the following lines:

Mapping_tools:
    What_tools_to_run: 'star, hisat, bbmap'
  1. In the /path/to/DICAST/scripts/snakemake/ folder run:

snakemake -j 1 -d /path/to/DICAST/input -s Snakefile -c snakemake_config.yaml

This command will start the mapping tools indicated in the snakemake_config.yaml (E.g. STAR, HISAT2 and bbmap).

First, the pipeline will build all necessary dockers. Second, in will create a /path/to/DICAST/index folder and put the results of indexing. Finally, the pipeline will create a /path/to/DICAST/output folder with the alignment results inside the dedicated folders (e.g., star-output, hisat-output, bbmap-output).

Running multiple alternative splicing event detection tools (E.g., MAJIQ and Whippet)

  1. Make sure you followed the steps described in the setup section carefully.

  2. Before getting started make sure to activate the snakemake conda environment:

conda activate dicast-snakemake
  1. Create the input folder:

cd /path/to/DICAST/
mkdir input
  1. Create the directory structure as in the sample_output:

cd input
mkdir controldir
cd controldir
mkdir fastqdir
mkdir bamdir
  1. Download or copy the genome fasta file into the input folder. Dont’t forget to uncompress it. E.g.:

cd /path/to/DICAST/input
wget http://ftp.ensembl.org/pub/release-105/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
  1. Download or copy the genome annotation file into the input folder. Dont’t forget to uncompress it. E.g.:

wget http://ftp.ensembl.org/pub/release-105/gtf/homo_sapiens/Homo_sapiens.GRCh38.105.gtf.gz
gunzip Homo_sapiens.GRCh38.105.gtf.gz
  1. Download or copy the genome gff3 annotation into the input folder (for MAJIQ). Dont’t forget to uncompress it. E.g.:

wget http://ftp.ensembl.org/pub/release-105/gff3/homo_sapiens/Homo_sapiens.GRCh38.105.gff3.gz
gunzip Homo_sapiens.GRCh38.105.gff3.gz
  1. Download or copy the fastq files you want to use into the /path/to/DICAST/input/controldir/fastqdir. Note: we support only paired-end RNA-Seq - fastq files have to be in pairs.

  2. Download or copy the bam files you want to use into the /path/to/DICAST/input/controldir/bamdir.

  3. Go to /path/to/DICAST/scripts and edit config.sh according to your run (see How to change your config.sh file):

cd /path/to/DICAST/scripts
nano config.sh

In the config.sh file edit the following lines:

read_length=76
fastaname=Homo_sapiens.GRCh38.dna.primary_assembly.fa
gtfname=Homo_sapiens.GRCh38.105.gtf
gffname=Homo_sapiens.GRCh38.105.gff3
  1. List the mapping tools you want to run:

cd /path/to/DICAST/scripts/snakemake/
nano snakemake_config.yaml

In the snakemake_config.yaml file edit the following lines:

Alternative_splicing_detection_tools:
   What_tools_to_run: 'majiq, whippet'
  1. In the /path/to/DICAST/scripts/snakemake/ folder run:

snakemake -j 1 -d /path/to/DICAST/input -s Snakefile -c snakemake_config.yaml

This command will start the mapping tools indicated in the snakemake_config.yaml (E.g. MAJIQ, Whippet).

First, the pipeline will build all necessary dockers. Second, the pipeline will create a /path/to/DICAST/output folder with the event detecton results inside the dedicated folders (e.g., majiq-output, hisat-output, whippet-output).

FAQ

Here you will find more frequently asked questions soon.

Q: How do I contribute to DICAST?
A: The best way to reach us for code updates is via our github
Q: How do I resolve issue: docker: Error response from daemon: error while creating mount source path..
A: Check if the folder that docker is trying to mount has the following permissions:drwxrwsr-x. Grant them when needed with `chmod a+rX,u+w,g+w `.

Uninstalling DICAST

In order to uninstall dicast, please execute the script with:

bash scripts/uninstall-dicast.sh

We hope DICAST served you well, please remember to cite DICAST, should you have found it useful.

About

Development

DICAST was jointly developed by the groups Big Data in Biomedicine, Computational Systems Medicine, and Computational Genomics and Transcriptomics Group

Maintainer: Amit Fenn With contributions from Tim Faro, Fanny Roessler, Johannes Kersting, Alexander Dietrich, Chit Tong Lio

Citation

If you use DICAST please cite the preprint

|Fenn, A.M., Tsoy, O., Faro, T., Roessler, F., Dietrich, A., Kersting, J., Louadi, Z., Lio, C.T., Voelker, U., Baumbach, J. and Kacprowski, T., 2022. Alternative splicing analysis benchmark with DICAST. bioRxiv.

Acknowledgments

DICAST was created with funding from the BMBF Sys_CARE project

Contact us

Amit Fenn <amit.fenn@tum.de>

Olga Tsoy <olga.tsoy@uni-hamburg.de>

Markus List <markus.list@tum.de>

Tim Kacprowski <t.kacprowski@tu-braunschweig.de>