The Human Proteome Folding Project
Links
Synopsis
Currently large fractions of proteins in all genomes have unknown function, and the vast majority do not have any annotation of protein structure (i.e., no homology to a sequence with known structure). Although large amounts of genome-wide data are accumulating, these systems biology data sets do not, in general, provide specific molecular function for these proteins. For example, we are developing and apply methods to extend the annotation of the Arabidopsis thaliana and, secondarily, rice genomes by incorporating protein structural information. This work has recently resulted in a collaborative grant to apply our structure prediction methods to several plant genomes with Mike Purugannan, Gloria Coruzzi, These methods are based on our successful structure-based functional annotation of the, among others, yeast, human and Halobacterium genomes. Structure-based methods provide an alternate route to function-annotation, yet the automatic interpretation of structure-function relationships is non-trivial and requires addressing several data integration challenges. We will address these challenges by a novel integration of best-in-class structure prediction (including Rosetta de novo structure prediction), phylogenetic and evolutionary analysis, and automated protein function prediction.
We will use the World Community Grid as our computational platform, sidestepping the computational barrier to genome-wide de novo structure prediction. Phylogenies of domains of gene families will also be estimated, and amino acid sites under positive selection during the evolution of these gene families identified by codon-based models and mapped onto the predicted structures, providing a structure-based evolutionary annotation of protein domains. These will all be developed into a user-friendly database for structure-based functional annotations that will ultimately be integrated into the Arabidopsis Information Resource (TAIR). No similar resource or set of tools is currently available to the Arabidopsis community, and we expect this approach to yield large numbers of correct structure and function predictions for the community, which would serve as a foundation for future genetic, genomic and proteomic analyses.
|