Isoforms

Necessary inputs

You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. It has to be a comma-separated file with 5 columns, and a header row as shown in the examples below.

sample,fastq_1,fastq_2,condition,batch
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,control,1,0
CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,control,2,0
CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,control,3,0

Important

Sample names have to be in the first column or in a column called sampleID.

If you have column names other than condition and batch you need to declare the names in the params_isoforms.yml. See below (Preprocess,DEA,PEA)

sampleID,condition
CONTROL_REP1,ctrl
CONTROL_REP2,ctrl
TREATMENT_REP1,treat

Start from SRA

Alternatively, instead of providing with paths of fastq files you can provide a column with SRA run identifiers you wish to download from NCBI in the first column named sample and the second column has to be named sampleID and store the SRA codes:

sample,sampleID,condition,batch,cl
SRR2015757,SRX10229011,1
SRR2015760,SRX10229042,1
SRR2015761,SRX10229053,1

An example samplesheet has been provided with the pipeline.

Reference files

The user has to provide the location of local reference files:

params{
  fasta_isoforms                   : '/home/bianca/gencode2/GRCh38.primary_assembly.genome.fa.gz'
  transcript_fasta_isoforms         : '/home/bianca/gencode2/gencode.v44.transcripts.fa.gz'
  gtf_isoforms                       : '/home/bianca/gencode2/gencode.v44.annotation.gtf.gz'

Running the pipeline

In order to run the isoform part of the pipeline you have to modify one file, specifying which part of the analysis you want to run and parameters associated with it: params_isoforms.yml

params{
outdir   '[full path of location you want to output]'
salmonDirIso '[full path of directory where outdir is/isoforms/salmon_isoforms/]'
input_isoforms '[full path of samplesheet with SRA code or location of fastq files]'

}

In addition you have to provide suitable reference fasta files regarding genome, transcripts and a gtf file regarding the genomic coordinates of the organism in study.

params{
fasta_isoforms   :  '[full path of location of fasta of the genome]'
transcript_fasta_isoforms:  '[full path of location of fasta of the transcripts ]'
gtf_isoforms :  '[full path of location of the gtf of the organism]'
}

The general command to run the pipeline is:

nextflow run nf-core/mom -c parmas_file nf-core/mom/params_isoforms.yml -profile docker

This will launch the pipeline with the docker configuration profile. See below for more information about profiles.

Note that the pipeline will create the following files in your working directory:

work                # Directory containing the nextflow working files
<OUTDIR>            # Finished results in specified location (defined with --outdir)
.nextflow_log       # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.

The pipeline initially downloads SRA codes and converts the runs into fastq files. Alternatively, you can provide local fastq files. It then performs quality control with [FASTQC] and then automatically detects and removes adapterS with [Trimgalore].

Each of the above steps can be skipped, for example if you don’t want to perform quality control, you can specify in the isoforms.config file:

params{
  skip_qc_isoforms= true
}

It then employs salmon in order to obtain quantification files with format

OUTDIR/salmon_isoforms/sampleID/quant.sf.

If you want to skip the alignement step you need to specify the location of those files in the respective field in the params_isoforms.yml file:

params{
  salmonDirIso # path where your outputs from aligningg are located
}

Note: All files need to be in the format:

salmonDirIso:
- sampleID/
  - quant.sf

After that isoformSwitchAnalyzer is used, which takes these quantification files and performs differential expression analysis on both the level of isoforms and genes. IsoformSwitchAnalyzer requires a samplesheet_isoforms.csv (phenotype file) with necessary columns sampleID and condition. The design matrix is of the form :

~0 + condition

Then diferentially expressed features are collected and their sequences are annotated regarding their coding potential [CPAT], their homology with protein domains [Pfam] and the existence of any signaling sequence [signalP]. This is performed with subworkflow functional_annotation.nf

The next step of the analysis is to assess functional implications of the differentially isoform/exon usage on the expression of the different genes and isoforms. We provide many insightful plots for this purpose under the direcorty OUTDIR/isovisual. Moreover, we additionally provide one output specifically focused on lncRNAs and a correlation matrix between differentially expressed lncRNAs and genes. Lastly, we provide the R object if the user wishes to inspect the results more thoroughly.