Long-read RNA-seq mapping - very close competition between GMAP and Minimap2
by Krešimir Križanović
Recently, our paper on RNA mappers for 3rd generation sequencing data has been published by Bioinformatics (https://doi.org/10.1093/bioinformatics/btx668). During the work on the paper, Minimap2 (https://github.com/lh3/minimap2) did not have support for RNA mapping and there was not enough time to include it in the tests during the revision process. Recently, we have run Minimap2 on our test dataset and the results (compared to other RNA mapping tools that we tested in our paper) are given in the tables below.
Full results for all tested mappers, the details on the used tools, datasets and metrics can be found in the paper (link above) and in the GitHub repository for the tools that we developed for the evaluation (https://github.com/kkrizanovic/RNAseqEval). All real and simulated datasets, as well as all of the data used for dataset simulation can be found at FigShare (https://figshare.com/projects/RNAseq_benchmark/24391).
Our test datasets include 4 simulated and 4 real datasets of varying complexity. The details are given in the table below. In our evaluation, we used different metrics for synthetic and for real datasets. For synthetic datasets, origin of each simulated read is precisely known (generated by the simulator). Therefore, it is possible to precisely evaluate mapping precision for each read. Three simulated datasets were simulated with the same error profile (for PacBio ROI), while one was simulated with ONT MinION error profile. For real datasets, read origins are unknown, and mapping is evaluated by comparing it to the set of annotations. We cannot determine with certainty if a read is mapped correctly, we can only check if it overlaps an exon or several exons in a series, corresponding to an annotation. All real datasets we obtained from the same organism, but have different error profiles, due to different technologies, error correction or read type.
Table 1. Test datasets.
The results show Minimap2 achieves very good results for all datasets, similar to GMAP (usually within few percent), the mapper that proved the best in our tests for the paper. It maps more reads than other mappers on all simulated and real datasets (row Aligned), or at least it reports the most reads as mapped.
The results on first three simulated datasets (PacBio ROI error profile) show that Minimap2 is always the best at mapping to at least one exon of the read origin. On datasets one and two (less complex, having less reads generated from more than one exon), Minimap2 also maps reads the best to all exons from their origin. On the most complex dataset three, GMAP is ahead in “hitting” all exons. Except on the least complex dataset one, Minimap2 falls behind GMAP at correctly mapping (within 5 bases) to the beginning, end and all exon boundaries of the read origin. Simulated dataset four was simulated using ONT MinION error profile and GMAP and Minimap2 show very similar results on all measures.
The results on real datasets are also very close between GMAP and Minimap2. On higher error rates, datasets three and four (PacBio subreads and ONT MinION 2D reads), Minimap2 is clearly better on all criteria. However, on lower error rates, datasets one and two (PacBio ROI and error-corrected PacBio ROI) Minimap2 reports more reads as mapped and is still better at hitting an exon from an annotation file. However, GMAP shows better results at mapping to a contiguous set of exons from an annotation file.
To summarize, the results achieved by Minimap2 are very good and very close to GMAP. It can be speculated that Minimap2 handles higher error rates better and is better at mapping to a general read origin. However, GMAP is slightly better at mapping to a precise read origin, especially at lower error rates. This hypothesis would, of course, have to be tested further, to be confirmed.
Table 2 (Modified table 3 in the paper). Evaluation on synthetic datasets.
All results are displayed as the percentage of all reads in the dataset. The percentages of reads that were aligned is shown (without assessing the accuracy), percentage of reads for which the beginning, the end and inner exon boundaries are accurately placed within 5 base-pairs (Correct), percentage of reads that overlap all exons of the read origin (Hit all) and percentage of reads that overlap at least one exon of the read origin (Hit one). Overlaps of hit one and hit all statistics need to be at least 5 bases.
Table 3 (Modified table 4 in the paper). Aligner evaluation on real datasets.
The table shows percentage of reads that were aligned (without assessing the accuracy), percentage of reads that overlap at least one exon (exon hit) and percentage of reads that overlap one or more exons in a sequence, corresponding to a gene annotation (contiguous exon alignment). All values are displayed as the percentage of all reads in the dataset. Overlaps for exon hit statistics need to be at least 5 bases.