Tutorial
Some Notes on Inputs
Please carefully read this section as inadequate inputs can impact the proper functioning of the software. If you ever wondered "why did they do this" when running the demo, you should be able to find your answer here.
-c
This .csv is the most crucial. It's a guide for HAMRLINC to locate your fastq files and present your samples in later visualizations. There are 2 columns. Left column contains the keys and right column contains the values.
Keys
Either SRR accession codes, or the file basename of the fastq input directory. If you decide to use SRR accession codes, the entire column must be SRR accession codes, visa versa. If you use SRR accession codes, each code must correspond to an existing entry in the Sequence Read Archive database. If you input your own fastq files, the file basenames should NOT include the file suffix like .fq or .fastq.gz in your csv file. See paired end example below.
Values
Succinct names for each fastq file in the context of the experiment. The naming must follow the form: SAMPLE_SEQTECH_REP. Note "_" is the delimiter and is crucial for the program. TREATMENT includes information on variant, genotype, and treatment. SEQTECH is the sequencing method used. REP is the biological replicate number. Below are some examples.
File Naming Conventions Examples
File name in csv | Actual sample |
---|---|
LemontHeat_mRNA_rep2 | a fastq file generated from mRNA-sequencing replicate 2 of a heat-treated Lemont variant O. sativa individual |
dxo1Salt_GMUCT_rep1 | a fastq file generated from GMUCT sequencing replicate 1 of a salt-treated A. thaliana individual with the gene AtDXO1 knocked out |
The demo csv files are great examples for what the actual file should look like.
Paired End
In the case of paired end sequencing, ignore the _1 and _2 in your key, and create only 1 row for each paired end sample. For example, if one of your soybean samples are paired end sequenced as /path/to/2024-1-2-soy-illumina3000-XYZ_1.fq and /path/to/2024-1-2-soy-illumina3000-XYZ_2.fq, in your csv, the row for this sample should look something like
Example
Key | Value |
---|---|
2024-1-2-soy-illumina3000-XYZ | soyWT_mRNA_rep1 |
-d
The input folder should contain only your fastq files, and nothing else. 4 types of suffixes are supported: .fq / .fastq / .fq.gz / .fastq.gz
TopHat2 mode
We recommend using STAR aligner for its speed and simplicity. However if you'd like to use TopHat2, you can use -a to activate it. In such case, you can also input a genome index folder with the flag -x. If none is inputted, the program will use bowtie2 to create one.
-k, -p, -u
The analysis steps are condensed into three modules, and each of these options can activate one of them. If none of these three flags are included, HAMRLINC will not perform any analysis.
Required Dependencies
- Linux-based computer, server, or cluster
- Docker
- Minimum memory of 32 GB and minimum disk space of 120 GB
Demo Data source
For this tutorial, we'll be using the RNA-Seq data generated by Yu et al 2021. In this work, they report that Messenger RNA 5′ NAD+ capping is a dynamic regulatory epitranscriptome mark that is required for proper response to abscisic acid in Arabidopsis. A graphic abstract is shown below:
Pulling HAMRLINC Docker Image
To run HAMRLINC, you need to first pull the docker image for the pipeline to your computer. If you are not familiar with container technology and would like to learn the basics, please check out CyVerse Container & Cloud Native Camp Documentation. It is open source and free. Dig in!
Pull HAMRLINC docker image. This should take a few minutes depending on your internet speed.
After building the conatiner, run the code below to be sure that you now have the image on your computer Your ouput should be similar to the image below:
clone HAMRLINC repo
download the genome file for Arabidopsis thaliana from ENSEMBLwget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.57.gff3.gz
gunzip Arabidopsis_thaliana.TAIR10.57.gff3.gz
docker run --rm -v $(pwd):/working-dir -w /working-dir chosenobih/hamrlinc:v0.3 -o hamrlinc_test -c /demo/PRJNA478205.csv -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -i Arabidopsis_thaliana.TAIR10.57.gff3 -l 50 -s 135000000 -n 8 -k
Some Notes on Software Behaviors and Terminal Output
Please understand that this analysis is time-consuming in its nature, where the time limiting factors are the individual packages used themselves, like STAR, samtools, or GATK. While we have enabled parallel processing and multicore where possible, in general, the entire run should still take well over an hour, and over 10 hours for extreme cases. To save headaches, we have included several features:
- Every run will automatically create a new log file in your working directory
- If you ran into an error and the program exited, the next run will resume from the last saved checkpoint.
Moreover, HAMRLINC can be unambiguously divided into 3 phases.
Phase 1: FASTQ Preparation
HAMRLINC will acquire or locate the fastq file. In the case of SRR accession code, we use SRA-toolkit's fasterq-dump function. Then, HAMRLINC trims each fastq with automatic adaptor detection with trim-galore, a wrapper around cutadapt. Finally, HAMRLINC performs fastqc for quality check. Files generated in this process are found in /datasets/trimmed and /datasets/fastqc. The trimmed fastq files will be used for alignment next.
Phase 2: Alignment, Pre-processing, HAMR and LINC
HAMRLINC will use either STAR or TopHat2 to align each fastq to the provided reference genome. It will then pipe the output through a series of samtools, stringtie2, cufflink, and gatk commands to pre-process the bam file as required for HAMR or EVOLINC, before running either or both of them. During this phase, you will see [SAMPLEKEY] associated with every status message for each sample outputted by HAMRLINC. You will also see numerous lines outputted by each of the packages used. This phase tends to take the longest.
Phase 3: Post-processing, Visualization
HAMRLINC will collect sequencing depth information with samtools coverage, call consensus across all replicates for each sample group (we take the union of pairwise intersection), overlap that consensus with the provided genome annotation file to associate different modifications with regions of an mRNA both structually (3', CDS, 5') and biologically (exon, non-coding). This information will then be summarized into mod_long.csv for the entire experiment (across different sample groups and sequencing methods). HAMRLINC will then generate some plots to help user visualize the spread, abundance, and types of each modification. During this, you might see many warning messages. Disregard them, unless an error is reported.
Output Interpretation
All outputs of HAMRLINC are organized in corresponding subdirectories of the output directory. When run with all three core processing enabled, HAMRLINC produces ten subdirectories in the output directory. Three subdirectories contain key intermediates like genome index files, trimmed fastq files and bed files, which can be used in various downstream processing of the user’s choice. Three other subdirectories contain the raw output for each of the three core functionalities; one last subdirectory contains the visualizations and post-HAMR analysis results.