Tutorial

Some Notes on Inputs

Please carefully read this section as inadequate inputs can impact the proper functioning of the software. If you ever wondered "why did they do this" when running the demo, you should be able to find your answer here.

-c

This .csv is the most crucial. It's a guide for HAMRLINC to locate your fastq files and present your samples in later visualizations. There are 2 columns. Left column contains the keys and right column contains the values.

Keys

Either SRR accession codes, or the file basename of the fastq input directory. If you decide to use SRR accession codes, the entire column must be SRR accession codes, visa versa. If you use SRR accession codes, each code must correspond to an existing entry in the Sequence Read Archive database. If you input your own fastq files, the file basenames should NOT include the file suffix like .fq or .fastq.gz in your csv file. See paired end example below.

Values

Succinct names for each fastq file in the context of the experiment. The naming must follow the form: SAMPLE_SEQTECH_REP. Note "_" is the delimiter and is crucial for the program. TREATMENT includes information on variant, genotype, and treatment. SEQTECH is the sequencing method used. REP is the biological replicate number. Below are some examples.

File Naming Conventions Examples

File name in csv	Actual sample
LemontHeat_mRNA_rep2	a fastq file generated from mRNA-sequencing replicate 2 of a heat-treated Lemont variant O. sativa individual
dxo1Salt_GMUCT_rep1	a fastq file generated from GMUCT sequencing replicate 1 of a salt-treated A. thaliana individual with the gene AtDXO1 knocked out

The demo csv files are great examples for what the actual file should look like.

Paired End

In the case of paired end sequencing, ignore the _1 and _2 in your key, and create only 1 row for each paired end sample. For example, if one of your soybean samples are paired end sequenced as /path/to/2024-1-2-soy-illumina3000-XYZ_1.fq and /path/to/2024-1-2-soy-illumina3000-XYZ_2.fq, in your csv, the row for this sample should look something like

Example

Key	Value
2024-1-2-soy-illumina3000-XYZ	soyWT_mRNA_rep1

-d

The input folder should contain only your fastq files, and nothing else. 4 types of suffixes are supported: .fq / .fastq / .fq.gz / .fastq.gz

TopHat2 mode

We recommend using STAR aligner for its speed and simplicity. However if you'd like to use TopHat2, you can use -a to activate it. In such case, you can also input a genome index folder with the flag -x. If none is inputted, the program will use bowtie2 to create one.

-k, -p, -u

The analysis steps are condensed into three modules, and each of these options can activate one of them. If none of these three flags are included, HAMRLINC will not perform any analysis.

Required Dependencies

Linux-based computer, server, or cluster
Docker
Minimum memory of 32 GB and minimum disk space of 120 GB

Demo Data source

For this tutorial, we'll be using the RNA-Seq data generated by Yu et al 2021. In this work, they report that Messenger RNA 5′ NAD+ capping is a dynamic regulatory epitranscriptome mark that is required for proper response to abscisic acid in Arabidopsis. A graphic abstract is shown below:

Pulling HAMRLINC Docker Image

To run HAMRLINC, you need to first pull the docker image for the pipeline to your computer. If you are not familiar with container technology and would like to learn the basics, please check out CyVerse Container & Cloud Native Camp Documentation. It is open source and free. Dig in!

Pull HAMRLINC docker image. This should take a few minutes depending on your internet speed.

docker pull chosenobih/hamrlinc:v0.3

After building the conatiner, run the code below to be sure that you now have the image on your computer

docker image ls

Your ouput should be similar to the image below:

clone HAMRLINC repo

git clone https://github.com/chosenobih/HAMRLINC.git
cd HAMRLINC

download the genome file for Arabidopsis thaliana from ENSEMBL

wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

download the annotation file for Arabidopsis thaliana from ENSEMBL

wget https://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-57/gff3/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.57.gff3.gz
gunzip Arabidopsis_thaliana.TAIR10.57.gff3.gz

run HAMRLINC in SE mode with SRA IDs. Only RNA modification annotation analysis arm is activated

docker run --rm -v $(pwd):/working-dir -w /working-dir chosenobih/hamrlinc:v0.3 -o hamrlinc_test -c /demo/PRJNA478205.csv -g Arabidopsis_thaliana.TAIR10.dna.toplevel.fa -i Arabidopsis_thaliana.TAIR10.57.gff3 -l 50 -s 135000000 -n 8 -k

Some Notes on Software Behaviors and Terminal Output

Please understand that this analysis is time-consuming in its nature, where the time limiting factors are the individual packages used themselves, like STAR, samtools, or GATK. While we have enabled parallel processing and multicore where possible, in general, the entire run should still take well over an hour, and over 10 hours for extreme cases. To save headaches, we have included several features:

Every run will automatically create a new log file in your working directory
If you ran into an error and the program exited, the next run will resume from the last saved checkpoint.

Moreover, HAMRLINC can be unambiguously divided into 3 phases.

Phase 1: FASTQ Preparation

HAMRLINC will acquire or locate the fastq file. In the case of SRR accession code, we use SRA-toolkit's fasterq-dump function. Then, HAMRLINC trims each fastq with automatic adaptor detection with trim-galore, a wrapper around cutadapt. Finally, HAMRLINC performs fastqc for quality check. Files generated in this process are found in /datasets/trimmed and /datasets/fastqc. The trimmed fastq files will be used for alignment next.

Phase 2: Alignment, Pre-processing, HAMR and LINC

HAMRLINC will use either STAR or TopHat2 to align each fastq to the provided reference genome. It will then pipe the output through a series of samtools, stringtie2, cufflink, and gatk commands to pre-process the bam file as required for HAMR or EVOLINC, before running either or both of them. During this phase, you will see [SAMPLEKEY] associated with every status message for each sample outputted by HAMRLINC. You will also see numerous lines outputted by each of the packages used. This phase tends to take the longest.

Phase 3: Post-processing, Visualization

HAMRLINC will collect sequencing depth information with samtools coverage, call consensus across all replicates for each sample group (we take the union of pairwise intersection), overlap that consensus with the provided genome annotation file to associate different modifications with regions of an mRNA both structually (3', CDS, 5') and biologically (exon, non-coding). This information will then be summarized into mod_long.csv for the entire experiment (across different sample groups and sequencing methods). HAMRLINC will then generate some plots to help user visualize the spread, abundance, and types of each modification. During this, you might see many warning messages. Disregard them, unless an error is reported.

Output Interpretation

All outputs of HAMRLINC are organized in corresponding subdirectories of the output directory. When run with all three core processing enabled, HAMRLINC produces ten subdirectories in the output directory. Three subdirectories contain key intermediates like genome index files, trimmed fastq files and bed files, which can be used in various downstream processing of the user’s choice. Three other subdirectories contain the raw output for each of the three core functionalities; one last subdirectory contains the visualizations and post-HAMR analysis results.

Abundance of modification by group — (a-g) Bar plots of the total abundance of HAMR predicted modifications by sample groups in CDS, exon, 5`UTR, gene, ncRNA, primary mRNA, and 3` UTR regions. (h) HAMR predicted modification abundance located in different RNA subtypes

Abundance of modification types — (a-g) Bar plots of the abundance of HAMR predicted modification classes by sample groups in CDS, exon, 5`UTR, gene, ncRNA, primary mRNA, and 3` UTR regions. (h) Number of HAMR predicted modifications per gene region

(a) Distribution of modification types in gene regions by sample groups. (b) Distribution of modification types in gene regions

Modification-based GO term heatmap — GO term heatmap and predicted enrichment landscape