Next generation sequencing (NGS) technology platforms have accelerated access to completed
genome assemblies. Recently, collaborators at Tygerberg Medical School outsourced the sequencing
of Mycobacterium orygis, a member of the Mycobacterium tuberculosis complex (MTC), to the
company Co-factor who used SOLiD (Support Oligonucleotide Ligation Detection) sequencing
technology. A total of 31,271,059 million short reads were generated and required filtering,
assembly and annotation using bioinformatics algorithms. In this project, an NGS assembly pipeline
was implemented tailored specifically for SOLiD sequence data. The raw reads were aligned using
NovoalignCS algorithm to seven fully sequenced and annotated MTC members, namely,
Mycobacterium tuberculosis H37Rv, H37Ra, CDC1551, F11, KZN 1435, Mycobacterium bovis
AF2122/97 and Mycobacterium bovis BCG str. Pasteur 1173P2. Depth and breath of sequence
coverage across each base of the reference genome was calculated using BEDTools, a suite of
utilities for comparing genomic features. Structural variation at the nucleotide level including
deletions, insertions and single nucleotide polymorphisms (SNPs) were called using three tools,
GATK, SAMtools and Nesoni. These variations were further filtered using in-house PERL scripts.
Putative functional roles for the alterations at the DNA level were extrapolated from the overlap
with essential genes present in annotated MTC members.
Approximately 20,730,631 short reads (59.78%) out of a total of 31,271,059 raw reads aligned to
the seven reference genomes. The per base sequence coverage calculations revealed an average of
1,243 unaligned regions. These unaligned regions overlapped with mycobacterial regions of
difference (RD) and genetic phage elements acquired by the MTC through horizontal gene transfer
and are genes prevalent in the clinical isolates of Mycobacterium tuberculosis. A total of 2,680
genetic variations were identified and were categorised into 845 synonymous and 1,724 non-
synonymous SNPs together with 44 insertions and 67 deletions. Some of the variant alleles
overlapped genes known to be involved in TB drug resistance. While the biological significance of
our findings remain to be elucidated, it nonetheless deserves further attention, because SNPs have
the potential to impact on strain phenotype by gene disruption. Therefore, any hypotheses generated
from these large-scale analyses will be tested by our collaborators at Tygerberg medical school.