Mem Inst Oswaldo Cruz, Rio de Janeiro, VOLUME 114 | JAN 2019
Complete and de novo assembly of the Leishmania braziliensis (M2904) genome
Universidad Autónoma de Madrid, Centro de Biología Molecular Severo Ochoa, Madrid, Spain
Leishmania braziliensis is the etiological agent of American mucosal leishmaniasis, one of the most severe clinical forms of leishmaniasis. Here, we report the assembly of the L. braziliensis (M2904) genome into 35 continuous chromosomes. Also, the annotation of 8395 genes is provided. The public availability of this information will contribute to a better knowledge of this pathogen and help in the search for vaccines and novel drug targets aimed to control the disease caused by this Leishmania species.
Protists of the genus Leishmania are causative agents of a group of diseases known as leishmaniasis, which range from self-curing cutaneous lesions to mucosal disfigurations and fatal visceral outcomes.(1) According to estimates of incidence and disease burden, leishmaniasis are among the six world’s most serious parasitic diseases.(2) The medical relevance of this parasite led to push the project of deciphering its genome sequence that was undertaken by an international consortium formed by the Wellcome Sanger Institute and many laboratories around the world. As a result, the Leishmania (L.) major genome sequence was published in 2005,(3) providing an invaluable framework for studies in very different areas of the field.(4) Soon after, as L. major causes in humans a relatively benign form of cutaneous leishmaniasis, representative species of the two other forms of leishmaniasis, mucosal (Leishmania braziliensis) and visceral (Leishmania infantum)were also selected for genome sequencing projects.(5) The genome sequence of another relevant species, Leishmania donovani,was published a few years later.(6) Nevertheless, the genome assemblies achieved for these species were of a lower quality than that obtained for the L. major genome; many sequence gaps remained in the chromosomal scaffolds generated. Despite the advances in high-throughput sequencing methods and the relatively small genome size of this species (around 32-Mbp, per haploid set), the abundance in repeated sequences (0.4-1 kb in length) that are scattered along the different chromosomes(7, 8) is a hurdle for sequence assemblers. In fact, most of the recent genomic assemblies reported for Leishmania species contain a significant number of sequence gaps and uncertainties in the structure of chromosomes.
Fortunately, third-generation sequencing technologies, having the capacity of generating long sequence reads, have emerged as valuable tools for solving many of the assembly problems arising with the use of short sequence reads.(9) In particular, single molecule real time (SMRT) sequencing, developed by Pacific Biosciences (PacBio(10)), is being incorporated as a necessary methodology for generating complete chromosomal assemblies. Recently, PacBio sequencing was used to generate a new reference genome for L. donovani.(11) Similarly, a complete L. infantum genome was obtained by our group(12) through a combination of PacBio long reads and paired-end short-reads generated by Illumina technology. In the present work, following the pipeline developed for the assembly of L. infantum, we present a complete assembly of the L. braziliensis (M2904 strain) genome into 35 chromosomes, providing the corresponding gene annotation.
In this study, the M2904 strain (MHOM/BR/75/M2904) of L. braziliensis was used, as it is the common reference for genomic studies.(5, 13) This strain was kindly provided by Dr J Moreno (WHO Collaborating Centre for Leishmaniasis, Centro Nacional de Microbiología, Instituto de Salud Carlos IIII, Madrid, Spain). Culturing conditions and the procedure for total DNA isolation have been described elsewhere.(14)
Firstly, Illumina sequencing of total DNA was carried out. Library construction and paired-end library sequencing were performed at the Centro Nacional de Análisis Genómico (CNAG-CRG, Spain; http://www.cnag.crg.eu/) using Illumina HiSeq2000 technology. A total of 51,049,711 paired-end, 2×126 bp sequence reads were generated with a mean insert size of 301 bp. PrinseqQuality (http://prinseq.sourceforge.net/) was applied to quality filtering/trimming of reads (cut-off value of 20, quality phred score), and only reads with length ≥ 60-nt were used (Fig. 1). Finally, 50,619,063 filtered reads were assembled using the CLC Genomics Workbench version 5.0 (CLC Bio). However, the results were not satisfactory as the genome was assembled into 2,686 scaffolds with 3,888 sequence gaps remaining.
To improve the assembly, PacBio sequencing was undertaken. A total of 276,308 pre-filtered reads with a median size of 16,600 nucleotides were generated on a PacBio RS II sequencing instrument. The Norwegian Sequencing Centre (www.sequencing.uio.no) provided the sequencing service. As outlined in Fig. 1, firstly, quality trimming of PacBio reads was performed by default in the HGAP pipeline (P_filter module). De novo genome assembly was carried out following a hierarchical genome-assembly process (HGAP(15)), using the HGAP3 (Pacific Biosciences, SMRT Analysis Software v2.3.0) and HGAP4 (Pacific Biosciences, SMRT Link 4.0.0) tools. Three different assembly strategies were used by varying the software tool and the size of the expected genome: 34 and 35 Mbp, using HGAP3, and 35 Mbp using HGAP4; as a result, 162, 151 and 90 contigs were obtained, respectively. The HGAP4-generated contigs were selected for further analysis. Forty-four out of the 90 contigs were determined to be spurious and discarded, giving their disproportionately low coverage (mean < 40×) and/or short length (< 15-Kb). Some of these contigs corresponded to maxicircle sequences, even though a complete sequence for the L. braziliensis maxicircle molecule could not be generated. The rest of the discarded contigs showed a chimeric structure, being composed by Leishmania sequences present in two or more regions of the chromosomal-size contigs, probably due to assembly artefacts.
The remaining 46 contigs (see Fig. 1), having mean coverage of 96× (Table), were preliminary assigned to chromosomes by BLAST analysis(16) using the L. major (Friedlin strain) chromosomes as reference.(17) This analysis indicated that 27 of the contigs represented complete L. braziliensis chromosomes. The remaining 19 contigs could be ordered into continuous chromosomes without any sequence gaps (Fig. 2) using different assemblers. In particular, SSPACE-standard [version 2.0 with extension option (-x 1)] software(18) was useful to join the contigs belonging to chromosomes 8 and 9; SSPACE-LongReads (v 1-1),(19) a tool which uses only the longest PacBio subreads, was successful for joining the two contigs forming the chromosome 27. The two contigs of the chromosome 10 were merged with pyScaf (v1) software (https://github.com/lpryszcz/pyScaf), which aligns long reads onto the contigs and identifies the reads that connect them. Three contigs were identified for chromosome 34, and joined by the combination of two tools: contig1 and contig2 could be joined by pyScaf, and the union between contig 2 and contig 3 was attained by minimus2 (v2),(20) which uses NUCmer (v3.1)(21) to compute overlaps between contigs. Finally, the contigs belonging to the three remaining chromosomes (2, 20 and 22) were joined with minimus2.
Afterwards, the accuracy of the joining points was evaluated using the Illumina paired-end reads. The gap-size in each chromosome was calculated and closed by combining the results generated with the tools GapCloser (v1.12-r6)(22) and Gapfiller.(23) GapFiller (v1.10) was run with the parameters -g 3 (three allowed gaps in Bowtie alignment) and -i 100 (number of iterations to fill a gap). Additionally, the contigs generated from the Illumina sequencing reads (see above) were aligned to the de novo assembled chromosomes using LAST aligner (v.8, http://last.cbrc.jp/) with -uNEAR option. We found nine Illumina contigs that aligned at the ends of chromosomes 8, 15, 16, 20, 21, 22, 23 and 27, but having overhanging sequences. Extension of these chromosomes was performed using MAFFT multiple-aligner (v.7.313). To extend the rest of chromosomes, the SSPACE-standard software was used.
Coverage analysis on the final chromosomal assemblies was performed using Illumina and PacBio reads [see Supplementary data (Figs 1-35)]. Illumina reads were aligned with Bowtie2 (v.2.34.3; option: --local), and PacBio bax.h5 reads were aligned with the pbalign (BLASR, v.0.2.0) tool. Coverage analysis was performed from each alignment along the 35 chromosomes using the GenomeCoverageBed tool (v.2.25.0, using -d option, which informs the depth at each genome position) of BEDtools2. The graphical coverage plots files were generated with GNUPLOT (v.4.6, patchlevel 2; http://www.gnuplot.info/). A continuous distribution of reads was found in all chromosomes, and the reads coverage in the eight chromosomes generated by joining of two or three contigs are shown in Fig. 2. The joining points of the chromosomes assembled initially in two or more contigs correspond to regions with tandemly repeated genes (or other repeated elements), and, according to the uneven coverage observed at these points in some of the chromosomes (Fig. 2), it is likely that the exact number of repeats has still not been determined.
At this point, a detailed sequence checking was performed using PacBio-utilities (v.1; https://github.com/douglasgscofield/PacBio-utilities), designed to correct deletions and insertions introduced with some frequency, mainly in homopolymer strings, by the PacBio sequencing. Corrections were dictated by the sequence derived from Illumina reads, which have higher accuracy than the PacBio ones. Sequence insertions/deletions were corrected when they were supported by more than ten Illumina reads and the indel was present in 80% (or above) of the reads mapping the concerned position. Additionally, a second correction step was carried out using an in-house Python script, which uses the output generated by the Pilon tool (v.1.2.2, using option: --diploid).(24) As a result, 556 positions were corrected, of which 161 corresponded to sequence deletions and 395 to insertions. The accuracy of these modifications was checked by visualisation of Illumina reads mapped to the assembled genome using the IGV (v.2.3) tool.(25)
Gene annotation of both protein-coding genes and known non-coding RNAs on the new assembled L. braziliensis genome was performed using Companion web server, using the default settings and selecting the L. major (Friedlin) annotation as a reference. Additionally, OrthoMCL web version(26) and BLAST searches were used to complement the annotations generated by Companion. Afterwards, all data were combined into a GFF3 file by an in-house Python script. Finally, a manual curation by IGV visualisation of the annotated genes was performed. The annotation file is provided as a supplementary dataset. As summarised in Table, a total of 8,395 genes have been annotated, of which 8,277 correspond to protein-coding genes (including 33 pseudogenes) and 118 are genes coding for structural and/or functional RNAs (i.e., rRNAs, tRNAs and snoRNAs).
To visually illustrate the changes in the chromosomal architecture found in the assembly reported here regarding the current reference genome (GeneDB), a comparison of chromosome 20 in both assemblies is shown in Fig. 3. In the current reference, this chromosome is assembled in two separate scaffolds, whereas a continuous chromosome was assembled in this work. Moreover, several regions were assembled in a different order. The availability of a robust genome sequence, as the one presented in this work, is a valuable resource for studies addressing whole-organism aspects following either genomics, transcriptomics or proteomics approaches. Nevertheless, to date, there is not any genome assembly that can be considered as set in stone, and the L. braziliensis genome reported here is not an exception. In particular, we are aware that the structure of the extremities for several of the assembled chromosomes has not been determined definitively, as the telomeric hexanucleotide repeats were not assembled for all the chromosomal ends. The Leishmania chromosomes contain subtelomeric regions consisting of repetitive sequences with variable length and number of repeats that are often chromosome-specific, and for these reasons dedicated methodological approaches are required for accurate determination of their structure.(27)
Data availability - Genomic raw data have been deposited in The European Nucleotide Archive (ENA; http://www.ebi.ac.uk/ena/). Also, the assembled genome and annotations files were uploaded under the Study accession number PRJEB25922 and study unique name ena-STUDY-CBMSO-09-04-2018-08:02:47:223-4008. Additionally, the new L. braziliensis genome sequence and annotations are available at the Leish-ESP web site (http://leish-esp.cbm.uam.es/).
FCR, BA and JMR - Conceived and designed the experiments; JMR - parasites cultures and preparation of DNA; SGF, RPP and AR - genome assembly; SGF and EC - genome annotation; SGF, EC and JMR - drawing of figures, drafted the manuscript; JMR and BA edited the final manuscript. All authors contributed to the writing, have read and approved the final manuscript.