Isoforms ======== Necessary inputs ---------------- You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. It has to be a comma-separated file with 5 columns, and a header row as shown in the examples below. .. code:: console sample,fastq_1,fastq_2,condition,batch CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,control,1,0 CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,control,2,0 CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,control,3,0 Important --------- .. rubric:: Sample names have to be in the first column or in a column called sampleID. :name: sample-names-have-to-be-in-the-first-column-or-in-a-column-called-sampleid. .. rubric:: If you have column names other than **condition** and **batch** you need to declare the names in the params_isoforms.yml. See below (Preprocess,DEA,PEA) :name: if-you-have-column-names-other-than-condition-and-batch-you-need-to-declare-the-names-in-the-params_isoforms.yml.-see-below-preprocessdeapea .. code:: console sampleID,condition CONTROL_REP1,ctrl CONTROL_REP2,ctrl TREATMENT_REP1,treat +---+-------------------------------------------------------------------+ | C | Description | | o | | | l | | | u | | | m | | | n | | +===+===================================================================+ | | Custom sample name. This entry will be identical for multiple | | | sequencing libraries/runs from the same sample. Spaces in sample | | s | names are automatically converted to underscores (``_``). | | a | | | m | | | p | | | l | | | e | | | | | | | | +---+-------------------------------------------------------------------+ | | Full path to FastQ file for Illumina short reads 1. File has to | | | be gzipped and have the extension “.fastq.gz” or “.fq.gz”. | | f | | | a | | | s | | | t | | | q | | | _ | | | 1 | | | | | | | | +---+-------------------------------------------------------------------+ | | Full path to FastQ file for Illumina short reads 2. File has to | | | be gzipped and have the extension “.fastq.gz” or “.fq.gz”. | | f | | | a | | | s | | | t | | | q | | | _ | | | 2 | | | | | | | | +---+-------------------------------------------------------------------+ | | Custom sample name. This entry will be identical for multiple | | | sequencing libraries/runs from the same sample. Spaces in sample | | s | names are automatically converted to underscores (``_``). | | t | | | r | | | a | | | n | | | d | | | n | | | e | | | s | | | s | | | | | | | | +---+-------------------------------------------------------------------+ | | Metadata describing your test condition (or treatment, or state | | | etc) | | c | | | o | | | n | | | d | | | i | | | t | | | i | | | o | | | n | | | | | | | | +---+-------------------------------------------------------------------+ | | Describes unwanted source of variation (e.g., technical | | | replicates, different platfroms, different batches etc.). | | b | | | a | | | t | | | c | | | h | | | | | | | | +---+-------------------------------------------------------------------+ Start from SRA -------------- Alternatively, instead of providing with paths of fastq files you can provide a column with SRA **run** identifiers you wish to download from NCBI in the first column named sample and the second column has to be named sampleID and store the SRA **codes**: .. code:: console sample,sampleID,condition,batch,cl SRR2015757,SRX10229011,1 SRR2015760,SRX10229042,1 SRR2015761,SRX10229053,1 An `example samplesheet `__ has been provided with the pipeline. Reference files ---------------- The user has to provide the location of local reference files: .. code:: bash params{ fasta_isoforms : '/home/bianca/gencode2/GRCh38.primary_assembly.genome.fa.gz' transcript_fasta_isoforms : '/home/bianca/gencode2/gencode.v44.transcripts.fa.gz' gtf_isoforms : '/home/bianca/gencode2/gencode.v44.annotation.gtf.gz' Running the pipeline -------------------- In order to run the isoform part of the pipeline you have to modify one file, specifying which part of the analysis you want to run and parameters associated with it: `params_isoforms.yml `__ .. code:: bash params{ outdir '[full path of location you want to output]' salmonDirIso '[full path of directory where outdir is/isoforms/salmon_isoforms/]' input_isoforms '[full path of samplesheet with SRA code or location of fastq files]' } In addition you have to provide suitable reference fasta files regarding genome, transcripts and a gtf file regarding the genomic coordinates of the organism in study. .. code:: bash params{ fasta_isoforms : '[full path of location of fasta of the genome]' transcript_fasta_isoforms: '[full path of location of fasta of the transcripts ]' gtf_isoforms : '[full path of location of the gtf of the organism]' } The general command to run the pipeline is: .. code:: bash nextflow run nf-core/mom -c parmas_file nf-core/mom/params_isoforms.yml -profile docker This will launch the pipeline with the ``docker`` configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: .. code:: bash work # Directory containing the nextflow working files # Finished results in specified location (defined with --outdir) .nextflow_log # Log file from Nextflow # Other nextflow hidden files, eg. history of pipeline runs and old logs. The pipeline initially downloads SRA codes and converts the runs into fastq files. Alternatively, you can provide local fastq files. It then performs quality control with [FASTQC] and then automatically detects and removes adapterS with [Trimgalore]. Each of the above steps can be skipped, for example if you don’t want to perform quality control, you can specify in the isoforms.config file: .. code:: bash params{ skip_qc_isoforms= true } It then employs salmon in order to obtain quantification files with format :: OUTDIR/salmon_isoforms/sampleID/quant.sf. If you want to skip the alignement step you need to specify the location of those files in the respective field in the params_isoforms.yml file: .. code:: bash params{ salmonDirIso # path where your outputs from aligningg are located } Note: All files need to be in the format: ----------------------------------------- .. code:: plaintext salmonDirIso: - sampleID/ - quant.sf After that isoformSwitchAnalyzer is used, which takes these quantification files and performs differential expression analysis on both the level of isoforms and genes. IsoformSwitchAnalyzer requires a samplesheet_isoforms.csv (phenotype file) with necessary columns sampleID and condition. The design matrix is of the form : .. code:: console ~0 + condition Then diferentially expressed features are collected and their sequences are annotated regarding their coding potential [CPAT], their homology with protein domains [Pfam] and the existence of any signaling sequence [signalP]. This is performed with subworkflow functional_annotation.nf The next step of the analysis is to assess functional implications of the differentially isoform/exon usage on the expression of the different genes and isoforms. We provide many insightful plots for this purpose under the direcorty OUTDIR/isovisual. Moreover, we additionally provide one output specifically focused on lncRNAs and a correlation matrix between differentially expressed lncRNAs and genes. Lastly, we provide the R object if the user wishes to inspect the results more thoroughly.