Racon version 1.0.0 release. Finally!
by Robert Vaser
The consensus caller Racon (Vaser et al., 2017), which was published last year in Genome Research (https://genome.cshlp.org/content/27/5/737), finally got an update. We fixed a bunch of bugs, added some new features and pushed the new version to master branch on our Github page as v1.0.0 (https://github.com/isovic/racon).
The tl;dr version of this post is as follows: Racon v1.0.0 is on average 1.6x faster, consumes 2.7x less memory, supports Illumina in both modes (error correction and consensus) and can be run without quality values (direct FASTA support, also both modes). Following paragraphs will address each of the new additions.
Speedup and memory consumption
The main speedup comes from SPOA update as it is the core tool of Racon. We remove Gotoh alignment, implemented prefix max vectorization for row updates, added AVX2 support (which is marginally faster than SSE4.1 due to higher shift latencies) and refactored the code so that alignment matrices are allocated only once per thread (so called alignment engines). Additionally, the whole Racon code was refactored in order to decrease the memory consumption and enable easier update to HPC systems. Roughly, the memory consumption for polishing equals the size of the three mandatory input files, while the memory is somewhat larger for error correction (due to more overlaps being used per read).
We tested the new version on a large amount of datasets (mostly bacteria) from both PB and ONT sequencers. The reads were assembled with miniasm (Li, 2016) and assemblies were polished with two iterations of both old and new Racon. Figure 1 shows execution time comparison, with average speedup factor of 1.6, while figure 2 shows memory consumption difference, with average decrease factor of 2.7. The difference in accuracy between the two versions is around 0.01% at average which was measured with Dnadiff (Delcher et al., 2003).
Figure 1 Comparison of CPU execution time between Racon versions (run on 12 threads with two Intel® Xeon® CPUs E5645 @2.4GHz).
Figure 2 Comparison of memory consumption between Racon versions (run on 12 threads with two Intel® Xeon® CPUs E5645 @2.4GHz).
Furthermore, we tested error correction on a ONT Escherichia coli dataset. We noted that the old version of Racon did not duplicate overlaps if minimap (or any other mapper) was run without duplicate overlap parameter. Therefore, some of the reads did not get full coverage. The new version duplicates the overlaps after alignment (so the mapper should not output them!) and yields better results which are shown in table 1.
Table 1 Error correction comparison for ONT E.coli dataset.
Illumina reads are detected automatically by calculating the average length of reads (if it is smaller than 1000bp). You can use any short-read mapper (we tested minimap2 –sr2 (Li, 2017)) to map them to your assembly which was polished with at least 1 round of long reads. We compared Racon illumina polishing on a couple of ONT datasets with Pilon (Walker et al., 2014). First we assembled the genomes with miniasm and polished them with Racon (1 or 2 iterations). Afterwards, we used 1 iteration with Illumina data. Results are shown in figures 3, 4 and 5. Racon is a bit slower than Pilon, but uses less memory and yields higher accuracy (we hope we ran Pilon the proper way with 'java -Xmx16G -jar ./pilon-1.22.jar --threads 12 --genome <assembly> --bam <alignments> --outdir <.> --output<out>).
Figure 3 Comparison of CPU execution time between Racon and Pilon for Illumina polishing (on two Intel® Xeon® CPUs E5645 @2.4GHz with 12 threads).
Figure 4 Comparison of memory consumption between Racon and Pilon for Illumina polishing (on two Intel® Xeon® CPUs E5645 @2.4GHz with 12 threads).
Figure 5 Accuracy difference between Racon and Pilon.
New version of Racon can be run without quality values, i.e. you can pass reads in FASTA files now. It will automatically detect the format from the file extension, if the extension is one of the following: .fa, .fasta, .fq, .fastq for reads and .paf, .mhap, .sam for overlaps. We might add other formats like GFA or BAM in the later versions.
We will soon be adding a wrapper script for running Racon in error correction mode when the input files are extremely large (the script will split the reads and combine results). We would like to add support for HPC (most probably openMPI) and are currently researching further speedups with a tiny bit of accuracy degradation.
Datasets for comparison between new and old Racon versions consist of:
· 377 PB bacteria datasets (http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/), full list obtainable via email!
· 12 ONT Klebsiella datasets (Wick et al., 2017)
· 1 ONT Escherichia dataset (http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/)
· 1 PB Escherichia dataset (https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly)
· 2 ONT Saccharomyces datasets (http://www.genoscope.cns.fr/externe/Download/Projets/yeast/datasets/raw_data/S288C)
· 1 PB Saccharomyces dataset (https://github.com/PacificBiosciences/DevNet/wiki/Saccharomyces-cerevisiae-W303-Assembly-Contigs)
For error correction we used the single ONT Escherichia dataset. For Illumina polishing, we used 12 ONT Klebsiella datasets (dataset number in figures represent its barcode) and 12 corresponding Illumina datasets (from the same source). Reference genomes were obtained from https://www.ncbi.nlm.nih.gov/genomes/browse/.
- Delcher,A.L. et al. (2003) Using MUMmer to Identify Similar Regions in Large Sequence Sets. Curr. Protoc. Bioinforma., 1–18.
- Li,H. (2017) Minimap2: fast pairwise alignment for long nucleotide sequences.
- Li,H. (2016) Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110.
- Vaser,R. et al. (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res., 27, 737–746.
- Walker,B.J. et al. (2014) Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9.
- Wick,R.R. et al. (2017) Completing bacterial genome assemblies with multiplex MinION sequencing. Microb. Genomics, 1–7.
Long-read RNA-seq mapping - very close competition between GMAP and Minimap2
by Krešimir Križanović
Recently, our paper on RNA mappers for 3rd generation sequencing data has been published by Bioinformatics (https://doi.org/10.1093/bioinformatics/btx668). During the work on the paper, Minimap2 (https://github.com/lh3/minimap2) did not have support for RNA mapping and there was not enough time to include it in the tests during the revision process. Recently, we have run Minimap2 on our test dataset and the results (compared to other RNA mapping tools that we tested in our paper) are given in the tables below.
Full results for all tested mappers, the details on the used tools, datasets and metrics can be found in the paper (link above) and in the GitHub repository for the tools that we developed for the evaluation (https://github.com/kkrizanovic/RNAseqEval). All real and simulated datasets, as well as all of the data used for dataset simulation can be found at FigShare (https://figshare.com/projects/RNAseq_benchmark/24391).
Our test datasets include 4 simulated and 4 real datasets of varying complexity. The details are given in the table below. In our evaluation, we used different metrics for synthetic and for real datasets. For synthetic datasets, origin of each simulated read is precisely known (generated by the simulator). Therefore, it is possible to precisely evaluate mapping precision for each read. Three simulated datasets were simulated with the same error profile (for PacBio ROI), while one was simulated with ONT MinION error profile. For real datasets, read origins are unknown, and mapping is evaluated by comparing it to the set of annotations. We cannot determine with certainty if a read is mapped correctly, we can only check if it overlaps an exon or several exons in a series, corresponding to an annotation. All real datasets we obtained from the same organism, but have different error profiles, due to different technologies, error correction or read type.
Table 1. Test datasets.
The results show Minimap2 achieves very good results for all datasets, similar to GMAP (usually within few percent), the mapper that proved the best in our tests for the paper. It maps more reads than other mappers on all simulated and real datasets (row Aligned), or at least it reports the most reads as mapped.
The results on first three simulated datasets (PacBio ROI error profile) show that Minimap2 is always the best at mapping to at least one exon of the read origin. On datasets one and two (less complex, having less reads generated from more than one exon), Minimap2 also maps reads the best to all exons from their origin. On the most complex dataset three, GMAP is ahead in “hitting” all exons. Except on the least complex dataset one, Minimap2 falls behind GMAP at correctly mapping (within 5 bases) to the beginning, end and all exon boundaries of the read origin. Simulated dataset four was simulated using ONT MinION error profile and GMAP and Minimap2 show very similar results on all measures.
The results on real datasets are also very close between GMAP and Minimap2. On higher error rates, datasets three and four (PacBio subreads and ONT MinION 2D reads), Minimap2 is clearly better on all criteria. However, on lower error rates, datasets one and two (PacBio ROI and error-corrected PacBio ROI) Minimap2 reports more reads as mapped and is still better at hitting an exon from an annotation file. However, GMAP shows better results at mapping to a contiguous set of exons from an annotation file.
To summarize, the results achieved by Minimap2 are very good and very close to GMAP. It can be speculated that Minimap2 handles higher error rates better and is better at mapping to a general read origin. However, GMAP is slightly better at mapping to a precise read origin, especially at lower error rates. This hypothesis would, of course, have to be tested further, to be confirmed.
Table 2 (Modified table 3 in the paper). Evaluation on synthetic datasets.
All results are displayed as the percentage of all reads in the dataset. The percentages of reads that were aligned is shown (without assessing the accuracy), percentage of reads for which the beginning, the end and inner exon boundaries are accurately placed within 5 base-pairs (Correct), percentage of reads that overlap all exons of the read origin (Hit all) and percentage of reads that overlap at least one exon of the read origin (Hit one). Overlaps of hit one and hit all statistics need to be at least 5 bases.
Table 3 (Modified table 4 in the paper). Aligner evaluation on real datasets.
The table shows percentage of reads that were aligned (without assessing the accuracy), percentage of reads that overlap at least one exon (exon hit) and percentage of reads that overlap one or more exons in a sequence, corresponding to a gene annotation (contiguous exon alignment). All values are displayed as the percentage of all reads in the dataset. Overlaps for exon hit statistics need to be at least 5 bases.
In our initial tests we didn’t consider whether a read represents a genuine part of an RNA, or is a sequencing artefact. Our only concern was whether an aligner was able to map a read to a location that we considered accurate. Due to quite busy schedule and holidays, we finally found some time to do some additional testing.
Minimap2 reports a significant number of reads in each real dataset as chimeric. This information was not provided by other aligners (with the parameters that we initially used). To better asses how GMAP compares to Minimap2 on genuine reads, we have removed all reads reported as chimeric from each real test dataset, and run our evaluation again. The results are shown in table 4. We have also rerun GMAP on real datasets with different parameters to see how many reads it reports as chimeric. The number of reads each aligner reports as chimeric is given in table 5.
Table 4 (Modified table 4 in the paper). Aligner evaluation on real datasets without chimeric reads.
Table 5. Number of reads reported as chimeric.
We can see that the results in table 4 differ from table 3 only slightly. Minimap2 in general achieves few fractions of a percent worse results than in table 3. This is due to the fact that Minimap2 was able to map all chimeric reads (Minimap2 was used to determine chimeric reads). GMAP, on the other hand, achieves slightly better results on dataset 1, while slightly worse results on other datasets. It can be concluded that GMAP was able to map fewer chimeric reads on dataset 1, and with excluding those reads from the dataset, GMAPs results improved. On other datasets GMAP was able to map almost all chimeric reads (close to Minimap2), and with excluding those reads from the datasets, GMAPs results worsened.
In table 5 we can see that in general PacBio datasets (5, 6 and 7) contain significantly more chimeric reads than ONT MinION dataset. This is due to the fact that PacBio sequencing instruments produce more artefacts. It is also interesting to note that GMAP reports chimeric reads somewhat inconsistently, with relatively few on ROI and corrected ROI datasets (datasets 5 and 6 respectively), and with much more on the subreads dataset (dataset 7).