SVM classification was used by Yosef et al. for predicting plasma lipid levels in baboons based on single nucleotide polymorphism data. In Someya inhibitor price et al,SVMs were used to predict carbohydrate binding proteins from amino acid sequences. The SVM is a discriminative learning method that infers, in a supervised fashion, the relationship between input features and a target variable, such as a certain phenotype, from labeled training data. The inferred func tion is subsequently used to predict the value of this target variable for new data points. This type of method makes no a priori assumptions about the problem domain. SVMs can be applied to datasets with millions of input features and have good generalization abilities, in that models inferred from small amounts of training data show good predictive accuracy on novel data.
The use of models that include an L1 regularization term favors solutions in which few features are required for accurate prediction. There are several reasons why sparseness is desirable the high dimensionality of many real datasets results in great challenges for processing. Many features in these datasets are usually non informative or noisy, and a sparse classi fier can lead to a faster prediction. In some applications, like ours, a small set of relevant features is desirable be cause it allows direct interpretation of the results. Results We trained an ensemble of SVM classifiers to distinguish between plant biomass degrading and non degrading microorganisms based on either Pfam domain or CAZY gene family annotations.
We used a manually curated data set of 104 microbial genome sequence samples for this purpose, which included 19 genomes and 3 metagenomes of lignocellu lose degraders and 82 genomes of non degraders. Fungi are known to use several enzymes for plant biomass degradation for which the corresponding genes are not found in prokary otic genomes and vice versa, while other genes are shared by prokaryotic and eukaryotic degraders. To investigate similarities and differences detectable with our method, we included the genome of lignocellulose degrading fungus Postia placenta into our analysis. After training, we identified the most distinctive protein domains and CAZy families Drug_discovery of plant biomass degraders from the resulting models. We compared these protein domains and gene families with known plant biomass degradation genes.
We furthermore applied our method to identify plant biomass degraders among 15 draft genomes from the metagenome of a microbial community adherent to switch selleck products grass in cow rumen. Distinctive Pfam domains of microbial plant biomass degraders For the training of a classifier which distinguishes between plant biomass degrading and non degrading microorganisms we used Pfam annotations of 101 mi crobial genomes and two metagenomes.