Bacteria Taxonomic Classification
In case, the second best phylum shows ≥30% NOGs matching with the NOGs of the best matched phylum, the best phylum is selected by comparing the NOGs of the query genome with the classes present in both the phyla. The class with maximum number of NOGs matches is selected as the best class and its corresponding phylum is selected as the best phylum. In case, the top two classes of the selected phylum shows ≥30% NOGs matches, best class is selected by comparing the NOGs of query genome with the unique NOGs of orders present in both the classes and the order with maximum number of matches is selected as best order and its corresponding class and phylum are selected as the best class and best phylum. If the order is correctly assigned, the lower taxonomic levels were assigned as per the methodology defined for a single best match (Figure ). The above methodology was used to develop a computational tool ‘Microtaxi’ which can be used to determine the taxonomy of a bacterial genome using its complete set of protein sequences as the input.
Performance of Microtaxi
Since only a small fraction (0.13-26.41%) of the total NOGs from any bacterial genome were selected in the list of taxon-specific NOGs; all 2, 406 genomes could be used as self-test set to evaluate the prediction accuracy of Microtaxi. It could predict the correct taxonomy till the species rank for 2, 342 genomes and till the genus rank for 2, 361 genomes (Additional file ). For the remaining 45 genomes it could correctly predict at order rank for 43 genomes and at family rank for 41 genomes.
On the first test set consisting of 56 bacterial genomes, it showed 100% accuracy of classification at phylum, class, order and family level and an accuracy of 96.30% at the genus level (Additional file ). On the second test set consisting of 36 recently published bacterial genomes, it displayed 100% accuracy of classification till the order rank. 35 of the 36 genomes were correctly classified till the genus rank and for the remaining one genome the correct classification could be made only till the order rank (Additional file ).
On the third test set consisting of 17 bacterial genomes for which the complete taxonomy is not yet known, Microtaxi could predict the taxonomic classification for all the genomes (Additional file ). The classification of Microtaxi was found correct for 16 out of 17 genomes on comparing it with the available taxonomic rank of these genomes. Since, for these genomes the complete taxonomy is not known and there is no reference to compare and validate the accuracy of the predicted classification, the results were confirmed using the 16S rRNA sequences of the four classes, alpha, beta, gamma and delta, of the proteobacteria phylum which was one of the phyla present among the 17 selected bacterial genomes. Among the four classes, the gamma_proteobacterium_HdN1 genome belonging to the gamma proteobacterium class was assigned as Hahella_chejuensis_KCTC_2396 by Microtaxi and it was also the only species identified in its family Hahellaceae. Therefore, the confirmatory 16S rRNA analysis could not be performed for this class.
For the remaining three classes, 16S rRNA sequences were retrieved for all the strains of the predicted family included in the training dataset, since the prediction of microtaxi are shown to be 100% accurate up to the family level. ForAlpha proteobacterium HIM B59 and Delta proteobacterium BABL1
, the maximum identity achieved on alignment with other strains of their respective predicted families using 16S rRNA was only 82.1 and 77.2%. Hence, for these two genomes confirmatory 16S rRNA analysis was not performed since 16S rRNA analysis is not reliable at such low identity. However, in the case ofbeta proteobacterium CB Polynucleobacter necessaries asymbioticus QLW P1DMWA-1.
Using the 16S rRNA sequences of 35 different species of the predicted family Burkholderiaceae and including the 16S rRNA ofPolynucleobacter necessaries asymbioticus QLW P1DMWA-1
in the same clade which confirms the predictions made by Microtaxi.
16S rRNA based phylogenetic tree of Beta proteobacterium CB and all the species of family Burkholderiaceae. Phylogenetic tree of (highlighted in Red) indicates that it is nearest to Plynucleobacter necessaries asymbioticus QLW P1DMWA-1 (highlighted in Green). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (1000 replicates) are shown next to the branches and the branch lengths are shown below the branches.
Functional analysis of unique NOGs
The functional analysis was carried out by classifying all the NOGs identified in a phylum (phylum-total) in 23 COGs-based functional categories. Similarly, the phylum-unique NOGs were classified into the 23 functional categories to compare the proportion of functional categories in phylum-total and phylum-specific NOGs. It was observed that out of the 23 COGs functional categories, only ‘U’ and ‘S’ categories were significantly (p ≤ 0
) overabundant (~1.4 and ~1.5 times, respectively) in the phylum-unique NOGs (Figure
). The overabundance was calculated by dividing the observed proportion of phylum-unique NOG by the proportion of same NOG in the phylum-total set. The ‘S’ category was found to be overabundant in all phyla, whereas, the ‘U’ category showed more than 1.2 times abundance in only 17 out of the total 27 phyla. The other functional categories were under represented in the phylum-specific NOGs as compared to their phylum-total proportion.