Racon version 1.0.0 release. Finally!
by Robert Vaser
The consensus caller Racon (Vaser et al., 2017), which was published last year in Genome Research (https://genome.cshlp.org/content/27/5/737), finally got an update. We fixed a bunch of bugs, added some new features and pushed the new version to master branch on our Github page as v1.0.0 (https://github.com/isovic/racon).
The tl;dr version of this post is as follows: Racon v1.0.0 is on average 1.6x faster, consumes 2.7x less memory, supports Illumina in both modes (error correction and consensus) and can be run without quality values (direct FASTA support, also both modes). Following paragraphs will address each of the new additions.
Speedup and memory consumption
The main speedup comes from SPOA update as it is the core tool of Racon. We remove Gotoh alignment, implemented prefix max vectorization for row updates, added AVX2 support (which is marginally faster than SSE4.1 due to higher shift latencies) and refactored the code so that alignment matrices are allocated only once per thread (so called alignment engines). Additionally, the whole Racon code was refactored in order to decrease the memory consumption and enable easier update to HPC systems. Roughly, the memory consumption for polishing equals the size of the three mandatory input files, while the memory is somewhat larger for error correction (due to more overlaps being used per read).
We tested the new version on a large amount of datasets (mostly bacteria) from both PB and ONT sequencers. The reads were assembled with miniasm (Li, 2016) and assemblies were polished with two iterations of both old and new Racon. Figure 1 shows execution time comparison, with average speedup factor of 1.6, while figure 2 shows memory consumption difference, with average decrease factor of 2.7. The difference in accuracy between the two versions is around 0.01% at average which was measured with Dnadiff (Delcher et al., 2003).
Figure 1 Comparison of CPU execution time between Racon versions (run on 12 threads with two Intel® Xeon® CPUs E5645 @2.4GHz).
Figure 2 Comparison of memory consumption between Racon versions (run on 12 threads with two Intel® Xeon® CPUs E5645 @2.4GHz).
Furthermore, we tested error correction on a ONT Escherichia coli dataset. We noted that the old version of Racon did not duplicate overlaps if minimap (or any other mapper) was run without duplicate overlap parameter. Therefore, some of the reads did not get full coverage. The new version duplicates the overlaps after alignment (so the mapper should not output them!) and yields better results which are shown in table 1.
Table 1 Error correction comparison for ONT E.coli dataset.
Illumina reads are detected automatically by calculating the average length of reads (if it is smaller than 1000bp). You can use any short-read mapper (we tested minimap2 –sr2 (Li, 2017)) to map them to your assembly which was polished with at least 1 round of long reads. We compared Racon illumina polishing on a couple of ONT datasets with Pilon (Walker et al., 2014). First we assembled the genomes with miniasm and polished them with Racon (1 or 2 iterations). Afterwards, we used 1 iteration with Illumina data. Results are shown in figures 3, 4 and 5. Racon is a bit slower than Pilon, but uses less memory and yields higher accuracy (we hope we ran Pilon the proper way with 'java -Xmx16G -jar ./pilon-1.22.jar --threads 12 --genome <assembly> --bam <alignments> --outdir <.> --output<out>).
Figure 3 Comparison of CPU execution time between Racon and Pilon for Illumina polishing (on two Intel® Xeon® CPUs E5645 @2.4GHz with 12 threads).
Figure 4 Comparison of memory consumption between Racon and Pilon for Illumina polishing (on two Intel® Xeon® CPUs E5645 @2.4GHz with 12 threads).
Figure 5 Accuracy difference between Racon and Pilon.
New version of Racon can be run without quality values, i.e. you can pass reads in FASTA files now. It will automatically detect the format from the file extension, if the extension is one of the following: .fa, .fasta, .fq, .fastq for reads and .paf, .mhap, .sam for overlaps. We might add other formats like GFA or BAM in the later versions.
We will soon be adding a wrapper script for running Racon in error correction mode when the input files are extremely large (the script will split the reads and combine results). We would like to add support for HPC (most probably openMPI) and are currently researching further speedups with a tiny bit of accuracy degradation.
Datasets for comparison between new and old Racon versions consist of:
· 377 PB bacteria datasets (http://www.sanger.ac.uk/resources/downloads/bacteria/nctc/), full list obtainable via email!
· 12 ONT Klebsiella datasets (Wick et al., 2017)
· 1 ONT Escherichia dataset (http://lab.loman.net/2015/09/24/first-sqk-map-006-experiment/)
· 1 PB Escherichia dataset (https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly)
· 2 ONT Saccharomyces datasets (http://www.genoscope.cns.fr/externe/Download/Projets/yeast/datasets/raw_data/S288C)
· 1 PB Saccharomyces dataset (https://github.com/PacificBiosciences/DevNet/wiki/Saccharomyces-cerevisiae-W303-Assembly-Contigs)
For error correction we used the single ONT Escherichia dataset. For Illumina polishing, we used 12 ONT Klebsiella datasets (dataset number in figures represent its barcode) and 12 corresponding Illumina datasets (from the same source). Reference genomes were obtained from https://www.ncbi.nlm.nih.gov/genomes/browse/.
- Delcher,A.L. et al. (2003) Using MUMmer to Identify Similar Regions in Large Sequence Sets. Curr. Protoc. Bioinforma., 1–18.
- Li,H. (2017) Minimap2: fast pairwise alignment for long nucleotide sequences.
- Li,H. (2016) Minimap and miniasm: Fast mapping and de novo assembly for noisy long sequences. Bioinformatics, 32, 2103–2110.
- Vaser,R. et al. (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res., 27, 737–746.
- Walker,B.J. et al. (2014) Pilon: An integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One, 9.
- Wick,R.R. et al. (2017) Completing bacterial genome assemblies with multiplex MinION sequencing. Microb. Genomics, 1–7.