Methodology =========== smallrnaseq is a Python package that implements some of the standard approaches for quantification and analysis of sncRNAs. This is usually implemented as part of a 'pipeline' that goes from raw fastq files to final counts of specific genes (e.g. coding transcripts or RNA species). For command line usage see the 'using smallrnaseq' section. Python programmers may find the various modules useful in creating their own more flexible workflows. A functional approach is used with a flat hierarchy of modules with limited use of classes. .. image:: https://raw.githubusercontent.com/dmnfarrell/smallrnaseq/master/img/workflow.png Counting known miRNAs --------------------- This program uses the known miRNA sequences from miRBase for counting of species specific sequences. The current release used is version 22 (March 2018). Counting isomiRs ---------------- Much miRNA expression profiling uses the read counts of all isoforms (isomiRs) of the canonical mature sequence summed together. While this may be appropriate for many cases it is also useful to consider the isoforms separately. This is true for several reasons: 1. you only wish to count the exact canonical sequence. 2. if the canonical mature sequence in miRBase is not even common in the tissue or samples you are studying and you wish to know this information. 3. your samples contain two isoforms that differ such that one will not be detected in (for example) a PCR assay, giving an inconsistent result. 4. one or more isoforms can better distinguish between sets of samples (i.e. control and disease) than the entire sum of counts. Classifying isomiRs ~~~~~~~~~~~~~~~~~~~ Sequence variants include 5’ and 3’ trimming and extension, non-templated additions (enzymatically addition of a nucleotide to the 3’ end, i.e. adenylation, uridylation). A single read can have more than one modification so sRNAbench provided a useful non-redundant hierarchical method of classifying isomiRs. See reference below. A similar scheme has been used by Loher et al. The sRNAbench scheme is illustrated below: isomiR counting ~~~~~~~~~~~~~~~ We have implemented the sRNAbench naming scheme. isomiR data is output by default when the ``map_mirbase`` method is called. If required, you can explicitly call the routine to count isomirs, assuming you have already aligned to the mirbase sequences and have a sam file. You should also provide the original read counts loaded from the csv file which is always created when the reads are collapsed. :: smrna.count_isomirs(samfile, truecounts, species='bta') References ~~~~~~~~~~ - Barturen, G., Rueda, A., … Hackenberg, M. (2014). sRNAbench: profiling of small RNAs and its sequence variants in single or multi-species high-throughput experiments. Methods in Next Generation Sequencing, 1(1), 21–31. - Telonis, A.G. et al., 2015. Beyond the one-locus-one-miRNA paradigm: microRNA isoforms enable deeper insights into breast cancer heterogeneity. Nucleic Acids Research, 43(19), pp.9158–9175. - Loher,P., Londin,E.R. and Rigoutsos,I. (2014) IsomiR Expression Profiles in Human Lymphoblastoid Cell Lines Exhibit Population and Gender Dependencies. Oncotarget, 5, 8790–8802. Counting genomic features ------------------------- Counting features means **counting the overlap of your reads with the locations of gene annotations** (the features) in a reference genome. A short read aligner is the first step. Note that if you are aligning to a human-sized reference genome, creating the index from the fasta file can take some time. Also this process can use a lot of memory. See the section on Aligners for more information. After read alignment to a reference genome the next step in transcriptomics is usually to count the features that each read intersects. This done using the sam or bam file generated by the aligner (e.g. bowtie) along with a gtf or gff file that stores the annotated genes - the features. These files are available for many spedcies from Ensembl. These will include annotations for `non coding RNAs`_. Counting features ~~~~~~~~~~~~~~~~~ :: import smallrnaseq as smrna import pandas as pd readcounts = pd.read_csv('counts.csv') fcounts = smrna.count_features(samfile, gtffile, truecounts=readcounts) This returns a pandas dataframe of the form: :: name reads 259 _no_feature 290705 50 _unmapped 145281 645 ENSBTAT00000060484 15162 287 ENSBTAT00000042309 6284 538 ENSBTAT00000051764 5575 ... This will merge the gtf file fields with the read counts, which can also be done explicitly using: :: res = base.merge_features(fcounts, gtffile) Links ~~~~~~~~~~ * `HTSeq feature counting`_ * `Ensembl ftp downloads`_ * `Ensembl genome annotation`_ .. _non coding RNAs: http://www.ensembl.org/info/genome/genebuild/ncrna.html .. _HTSeq feature counting: http://www-huber.embl.de/users/anders/HTSeq/doc/counting.html .. _Ensembl ftp downloads: http://www.ensembl.org/info/data/ftp/index.html .. _Ensembl genome annotation: http://www.ensembl.org/info/genome/genebuild/genome_annotation.html Novel miRNA prediction ---------------------- **Novel miRNAs** in a species are generally those for which the mature sequence is not present in miRBase. This may be because the species has not been well studied or insufficient evidence is available to consider a sequence a real miRNA. There are problems with false positives in general for miRNA discovery from deep sequencing and this should be kept in mind when running an analysis of your reads. It is advised to use at least two algorithms for such an analysis. Most of the higher abundance miRNAs have been identified in the commonly studied species. Detection/prediction of new mirs will therefore involve looking at the low abundance (or tissue specific) forms which will need further evidence such as conservation, experimental verification and perhaps identification of function. There are by now multiple algorithms available for predicting novel miRNAs from small RNA sequencing data. The most popular is probably miRDeep2. This algorithm is accessible via the mirdeep2 module assuming you have installed mirdeep2 as well. However we include a native algorithm in smallrnaseq for novel prediction, this is described below. Method ~~~~~~ The built-in method for novel prediction should be considered a somewhat ‘quick and dirty’ method at present but is relatively fast and convenient to use. The basic idea is to take clusters of reads that could be mature sequence and find suitable precursors. Structural features of each precursor are then scored using a classifier. The best candidate is selected is there is at least one. We have followed a similar approach to the miRanalyzer method. The following features are currently used in our algorithm, most are the same as those used in sRNAbench (miRanalyzer). The diagram below may help to clarify some of the terminology used. .. image:: https://raw.githubusercontent.com/dmnfarrell/smallrnaseq/master/img/mirna_example.png +-------------------------+--------------------------------------------+ | Feature | Description | +=========================+============================================+ | Length | The length of the longest hairpin | | | structure | +-------------------------+--------------------------------------------+ | Stem length | The length of the longest hairpin | | | structure stem | +-------------------------+--------------------------------------------+ | Mfe | The mean free energy of the hairpin | +-------------------------+--------------------------------------------+ | Loop length | The number of bases in the loop of the | | | hairpin | +-------------------------+--------------------------------------------+ | Loop GC | The GC-content of the loop | +-------------------------+--------------------------------------------+ | GC | The GC-content of the small hairpin | +-------------------------+--------------------------------------------+ | Asymmetric bulges | The number of asymmetric bulges and | | | mismatches in the stem | +-------------------------+--------------------------------------------+ | Symmetric bulges | The number of symmetric bulges and | | | mismatches in the stem | +-------------------------+--------------------------------------------+ | Bulges | The number of bulges in the stem | +-------------------------+--------------------------------------------+ | Longest bulge | The number of non-pairing nucleotides of | | | the longest bulge | +-------------------------+--------------------------------------------+ | Hairpin mismatches | The number of single mismatches in the | | | hairpin | +-------------------------+--------------------------------------------+ | Mature mismatches | The number of single mismatches in the | | | mature microRNA region of the hairpin | +-------------------------+--------------------------------------------+ | Triplet-SVM features | All features that were proposed by Xue et | | | al. | +-------------------------+--------------------------------------------+ Algorithm: ^^^^^^^^^^ The basic steps for novel precursor detection are as follows. 1. if multiple samples, read counts for all samples are put together using each sam file and total counts for each unique read (after collapsing the original fastq files) 2. the reads are sorted by read count 3. reads are clustered using cluster trees 4. read clusters are themselves clustered to detect pairs within 100 nt (for animals). these are considered to be possible mature/star arms and checked for hairpin structure 5. single clusters are checked for precursors by creating multiple possible precursors at the 5’ and 3’ ends and evaluating the features 6. a precursor is discarded if: - it has no hairpin - its read cluster overlaps with the hairpin loop - it has less than 19 bindings in the stem - it has less than 11 bindings to the region occupied by the read cluster 7. a random forest classifier/regressor trained on known positives and negatives is used to score the precursor features 8. the precursor with the lowest energy and highest score is used as the most likely candidate 9. for single clusters this region is considered the mature arm and the star sequence is estimated 10. the list of novel miRNAs with precursor, mature, star, reads is output Command line usage ~~~~~~~~~~~~~~~~~~ Using the command line tool it is simply a matter of setting ``novel = 1`` in the config file. Reads will first be mapped to mirbase to remove known miRNAs and also any libraries you included in the indexes option. The novel prediction step from the command line tool will produce a file called ``novel_mirna.csv`` in the output folder. Code examples ~~~~~~~~~~~~~ :: from smallrnaseq import novel import pandas as pd #single file prediction readcounts = pd.read_csv('countsfile.csv') samfile = 'mysamfile.sam' reads = utils.get_aligned_reads(samfile, readcounts) new = novel.find_mirnas(reads, ref_fasta) References ~~~~~~~~~~ - Kang, W. & Friedländer, M.R., 2015. Computational Prediction of miRNA Genes from Small RNA Sequencing Data. Frontiers in Bioengineering and Biotechnology, 3, p.7. - Hackenberg, M. et al., 2009. miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments. Nucleic Acids Research, 37(Web Server), pp.W68–W76. - Friedländer, M.R. et al., 2012. miRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic acids research, 40(1), pp.37–52. - Shi, J. et al., 2015. mirPRo–a novel standalone program for differential expression and variation analysis of miRNAs. Scientific Reports, 5, p.14617. - Xue, C. et al., 2005. Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC bioinformatics, 6, p.310. - Lopes, I.D.O.N. et al., 2014. The discriminant power of RNA features for pre-miRNA recognition. BMC bioinformatics, 15(1), p.124.