Using smallrnaseq

This page refers to using the Command Line Interface of smallrnaseq. For programmers using the API see code examples.

Installing the package provides the command smallrnaseq in your path. This allows users is a command line interface to the library without the need for any Python coding at all. It provides a set of pre-defined functions with parameters specified in a text configuration file. The primary input is one or more fastq files containing short read data from a small rna-seq experiment. Note: It is assumed they have already been adapter trimmed, if required.

Usage

Usage largely involves setting up the config file and having your input files prepared. Running the command smallrnaseq -c default.conf will create a new config file for you to work from if none exists. Just edit this with a text editor and then to execute:

smallrnaseq -c default.conf -r

Several other functions are available from the command line without the config file, i.e. to collapse or trim reads. Type smallrnaseq -h to get a list of options.

Configuration file settings

The advantage of configuration files is in avoiding long commands that have to be remembered or are prone to mistakes. Also the config files can be kept to recall what setting we used or to copy them for another set of files. The current options available in the file are shown below. The meaning of each option is explained explained below. If you are unsure or don’t require an option value, leave it at the default. Options can be excluded from the file completely and defaults will be used but it’s recommended to just leave unused options blank. You can also comment out lines with a ‘#’ at the start if you want it to be ignored. The [base] heading should always be present an indicates a section of the file. The [aligner] section is for setting alignment parameters on a per library basis if you need to. The [novel] section is for a few novel mirna prediction settings. The [de] section is used for differential expression which can be ignored if you don’t need that:

[base]
filenames =
path =
overwrite = 0
index_path = indexes
libraries =
ref_fasta =
features =
output = results
add_labels = 1
aligner = bowtie
mirna = 0
species = hsa
pad5 = 3
pad3 = 5
verbose = 1
cpus = 1

[aligner]
default_params = -v 1 --best
mirna_params = -v 1 -a --best --strata --norc

[novel]
score_cutoff = 0.8
read_cutoff = 50

[de]
sample_labels =
sep = ,
sample_col =
factors_col =
conditions =
logfc_cutoff =

Settings explained:

name example value meaning
filenames test.fastq input fastq file(s) with reads, comma separated
path testfiles folder containing fastq files instead of listing files
index_path indexes location of bowtie or subread indexes
aligner bowtie which aligner to use, bowtie or subread
ref_fasta hg19 reference genome fasta file, optional
libraries RFAM_human,mirbase-hsa names of annotated library indexes to map to
features Homo_sapiens.GRCh37.75.gtf genome annotation file. ONLY needed for counting genomic features
output smrna_results output folder for temp files
add_labels 1 whether to add labels to replace the file names in the results (0 or 1)
mirna 0 run mirna counting workflow (0 or 1)
species bta mirbase species to use
pad5 3 3’ flanking bases to add when generating mature mirbase sequences
pad3 5 5’ flanking bases to add
verbose 1 print extra information
cpus 1 number of threads to use
sample_labels samplefile.txt csv file with sample labels
default_param s -v 1 –best default alignment parameters
mirna_params -v 1 -a –best –strata –norc default miRNA alignment parameters

Examples

Mapping to one or more libraries of known RNAs

This will simply consist of setting the libraries option to match the names of files you have created aligner indexes for. You can download RNA annotations in fasta format from http://rnacentral.org/. The index files should be placed in the index_path folder. To create an index using bowtie you supply a fasta file and the indexes are placed in a folder called ‘indexes’. You can move them to another folder if needed:

smallrnaseq -b myrnas.fa

By default file names are replaced with short ids of the form s01, s02, etc. This also writes out a file called sample_labels.csv in the output folder. Set add_labels = 0 if you don’t want this behaviour.

Mapping to known miRNAs and finding novel miRs

Say we have a set of fastq files in the folder ‘testfiles’ that we want to count miRNAs in. We would set the options mirna = 1 and path = testfiles. Note if just mapping to mirbase mature/precursor sequence you don’t have to create an index file since it is generated automatically. If you are using any other libraries you should create the aligner index first. For novel miRNA discovery we need the reference genome and index for that.

Mapping to a reference genome with small RNA features

There are some advantages to using a full reference genome in that it allows us to see of the reads align outside the target transcripts and we might therefore exclude them. Also we can normalise the read counts by the sum of all mapped reads. This depends on what features you have in the gtf file. To count our reads using features in a reference genome, provide the ref_genome name which should correspond to the index name of the reference sequence. We also need to set features, a set of features stored in a gtf, gff or bed file. Remember that the coordinates in this file must correspond to the reference sequence. See Counting features for more information.

Outputs

The main outputs are csv files with the counts for each sample in a column, also produced is a ‘long form’ csv file with a separate row for every sample. These csv files can be opened in a spreadsheet. Temporary files such as collapsed read files are placed in a ‘temp’ folder inside the output folder.

When you run the mirna workflow on a set of samples, a file called mirbase_mature_counts.csv is created in the folder. The column names are mostly self explanatory. Each sample has a column for the raw counts and normalized counts. mean_norm is the normalized mean and total_reads is the sum of all raw read counts:

name,db,sample1,sample2,sample3,sample1 norm,sample2 norm,sample3 norm,total_reads,mean_norm
bta-miR-486,mirbase-bta,1444,1070,5579,176722.56,47917.6,127569.57,239793,284398.062
bta-miR-122,mirbase-bta,693,10676,4,84812.14,478101.21,91.46,11409,149778.2825
bta-miR-423-5p,mirbase-bta,337,100,2008,41243.42,4478.28,45914.98,4270,62353.636
bta-miR-22-3p,mirbase-bta,315,1053,5655,38550.97,47156.29,129307.39,7462,45154.618
bta-miR-92a,mirbase-bta,332,891,2484,40631.5,39901.48,56799.21,5989,39094.122
bta-miR-192,mirbase-bta,659,656,1482,80651.08,29377.52,33887.45,2945,33098.118
bta-miR-21-5p,mirbase-bta,453,227,1311,55439.97,10165.7,29977.36,2054,27358.996
..

Differential expression

This workflow is done using a separate command and via some extra options not shown above for clarity. To execute this type of analysis you must have done gene counting (e.g. miRNAs) and have the results in a csv file. The analysis is then run using:

smallrnaseq -c default.conf -d

In the default file the additional options needed are in a separate [de] section. Most are blank by default. If you want to do differential expression analysis of the genes from your results the other main thing you need to provide is a sample label file that matches the filenames you have analysed to the conditions. You then choose which conditions you want to compare. The options are explained below.

name example value meaning
count_file mirna_counts.csv csv file with gene counts
sample_labels labels.txt csv/text file with sample labels
sep , separator for sample labels text file
sample_col file_name column for the sample file names
factors_col status column for factor labels
conditions control,infected conditions to compare from factor column (each replicate will have the same condition)
logfc_cutoff 1.5 cutoff for log fold changes (lower are ignored)

This analysis is run using the R edgeR package, so R must be installed. See Installing R.

Example

The sample label file below shows how we use the above options. Our filenames are in the Run_s column. We want to compare the conditions 3 and 6 months samples in the age_s (the ‘factor’) column. mirna_mature_counts.csv is the file where counts from our mirna analysis are stored. This should have a column for each sample. Note that you could use counts output from any program as long as they are csv and have the right column names, see the output file formats section. It is assumed you are using replicates. Note that column names are case sensitive:

Run_s age_s   isolate_s
SRR3457948    3 months        animal1
SRR3457949    6 months        animal1
SRR3457950    6 months        animal4
SRR3457951    15 months       animal4
SRR3457952    3 months        animal5
SRR3457953    6 months        animal5
SRR3457954    15 months       animal5
SRR3457955    3 months        animal6
SRR3457956    6 months        animal6
...

So the config file will look like this:

[de]
sample_labels = SraRunTable.txt
sep = tab #tab delimiter
count_file = mirna_mature_counts.csv
sample_col = Run_s
factors_col = age_s
conditions = 3 months,6 months
logfc_cutoff = 1.5

Adapter trimming

Since there are numerous programs to perform this task it is left to the user to perform trimming of the reads prior to input.

Aligners

Traditional sequence alignment tools like BLAST are not well suited for next generation sequencing where one needs to align millions of short sequences very quickly. This has given rise to the development of a new class of short read aligners of which there are now dozens available. For small RNA-seq data alignment present certain specific challenges but standard aligners used for normal RNA-seq are usually adequate with the right parameters. The aligner and parameters will have an effect on the results, so trying more than one might be a good idea.

Currently smallrnaseq integrates bowtie (version 1) and subread. Though others can be added on request. These are free and easy to install on linux and OSX systems. On Ubuntu the following command installs both:

apt install bowtie subread

Alignment settings

The parameters used for the alignment/mapping procedure can be important in the final counts produced, irrespective of the aligner used.

This can be a complex topic in itself and general users will be confused by the many options. The command line tool for smallrnaseq takes the simple approach of providing a default alignment parameter for general mapping to libraries, another for mapping miRNAs and one for reference genomes. All can be changed in the config file if needed. You can also set custom parameters per library in the aligner section.

Bowtie

For general mapping -v 1 --best is used. -v 1 reports read mappings with up to one mismatch, options --best orders the mappings from best to worse alignments.

In miRDeep2 when mapping to the mature miRNAs (miRBase sequences) for mature quantification the following parameters are used:

-v 1 -a --best --strata --norc

Here -a means report all valid alignments, options --best --strata orders the mappings from best to worse alignments according to the strata definition of bowtie. --norc means do not map reads to the reverse complement of the sequences.

For reference genome mapping miRDeep2 uses these parameters:

-n 0 -e 80 -l 18 -a -m 5 --best --strata

Subread

-m 2 -M 1 is the default for general alignment to libraries. If you use subread you can check the parameters by typing subread-align --help at the command line, or refer to the website.

Configuration file

In the aligner section set your parameters. In the example below bos_taurus is the name of the reference genome. We have also used custom settings for mirna and another library of tRNAs.

[aligner]
mirna_params = -n 1 -l 20
bos_taurus = -v 1 -k 50
bosTau8-tRNAs = -v 0 --best

Code example

If using the package in your python code, aligner parameters are set via the aligners module. This is done before calling mapping routines such as map_rnas.

for example:

from smallrnaseq import aligners
aligners.BOWTIE_PARAMS = '-v 0 --best'
aligners.SUBREAD_PARAMS = '-m 0 -M 1'

References

  • Shi, J., Dong, M., Li, L., Liu, L., Luz-Madrigal, A., Tsonis, P. A.,
… Liang, C. (2015). mirPRo–a novel standalone program for differential expression and variation analysis of miRNAs. Scientific Reports, 5, 14617. http://doi.org/10.1038/srep14617
  • Friedländer, M. R., Mackowiak, S. D., Li, N., Chen, W., & Rajewsky,
N. (2012). miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Research, 40(1), 37–52. http://doi.org/10.1093/nar/gkr688