The goal of our project is the definition of a complete set of the evolutionary histories (cascade of phylogenetic events) for the human proteome and their genome-scale analysis.
The genetic information encoded in the genome sequence contains the blueprint for the potential development and activity of an organism. This information can only be fully comprehended in the light of the evolutionary events (duplication, loss, recombination, mutation…) acting on the genome, that are reflected in changes in the sequence, structure and function of the gene products (nucleic acids and proteins) and ultimately, in the biological complexity of the organism.
The recent availability of the complete genome sequences of a large number of model organisms means that we can now begin to understand the mechanisms involved in the evolution of the genome and their consequences in the study of biological systems. This is illustrated by the evolutionary analyses and phylogenetic inferences that play an important role in most functional genomics studies, e.g. of promoters (‘phylogenetic footprinting’), of interactomes (notion of ‘interologs’ based on the presence and degree of conservation of counterparts of interactive proteins), and also, in comparisons of transcriptomes or proteomes (notion of phylogenetic proximity and co-regulation/co-expression).
At the same time, theoretical advances in information representation and management have revolutionised the way experimental information is collected, stored and exploited. Ontologies, such as Gene Ontology (GO) or Sequence Ontology (SO), provide a formal representation of the data for automatic, high-throughput data parsing by computers. These ontologies are being exploited in the new information management systems to allow large scale data mining, pattern discovery and knowledge inference.
Unfortunately, the vast number and complexity of the events shaping eukaryotic genomes means that a complete understanding of evolution at the genomic level is not currently feasible. At the lowest level, point mutations affect individual nucleotides. At a higher level, large chromosomal segments undergo duplication, lateral transfer, inversion, transposition, deletion and insertion. Ultimately, whole genomes are involved in processes of hybridization, polyploidization and endosymbiosis, often leading to rapid speciation.
We propose to characterise and to study the evolutionary histories of the human proteome, defined as the impact in the human proteins (extensions, insertions, deletions…) of the cascade of genetic events (duplication, lateral transfer, inversion, transposition, deletion, insertion…) that occurred during the evolution of the vertebrate genomes. This ambitious objective is now possible thanks to the emergence of formal descriptions of biological data and to the recent developments of accurate phylogenetic reconstruction and genome analyses (Partner 1: Figenix platform) and of automated reliable and exploitable protein sequence alignments (Partner 1 & 2: TCOFFEE, PipeAlign, MAO, MACSIMS…). These methodologies will be combined into a multi-agent, expert system for the construction of evolutionary histories. In order to facilitate the automatic definition of the important genetic events shaping a single protein and their potential causalities at the genome level, a new ontology will be developed. In a subsequent step, the evolutionary histories of the complete human proteome will be reconstructed, followed by their classification into protein sets sharing typical evolutionary histories, and the functional analysis of these sets. An analysis at the genomic level will be realized for a specific number of proteins identified in the classification and functional analysis step.
Definition of an ontology of genetic events and their consequences
The first stage of the project will be the formal specification of genetic events and evolutionary concepts in the form of an ontology, which will allow their exploitation in automatic knowledge extraction and inference systems. Ontologies are essential in biology for integration, organization, and knowledge management of heterogeneous information. Ontologies also provide a means of dissemination of knowledge between experts in different fields (molecular biologists, computer scientists and mathematicians). The ontology will cover the genetic events at the genomic level, such as gene duplication and loss, hybridization, horizontal gene transfer, or recombination, as well as their consequences at the protein level, in terms of domain insertions/deletions and extensions. The ontology will specify individual concepts and the relationships existing between these concepts. An important aspect of the ontology development will be the specification of links to existing biological ontologies, particularly SO and MAO, Multiple Alignment Ontology (Thompson et al., 2006). Relations will be based on the Relation Ontology (RO) wherever possible.
Development of an expert system for the reconstruction of the evolutionary history of a single protein
We will then develop methodologies that will allow the automatic reconstruction of the evolutionary history for a given gene. These will cover the automatic identification of homologs and the construction of a high quality Multiple Alignment of Complete Sequences (MACS), using the MAFFT or T-Coffee algorithms. For large protein families, a clustering method (TribeMCL) will be used to divide the set of homologs, into subsets containing less than 250 sequences, which will allow us to handle these cases efficiently. A detailed quality analysis of the multiple alignment will allow the identification of the reliable regions (RASCAL, LEON, NorMD) and the construction of an accurate phylogenetic tree (Figenix). The MACS will also be used to calculate the evolutionary rate of the gene, to determine the domain organisation (MACSIMS) and to identify family or sub-family specific residues (OrdAlie). The results will allow us to identify important genetic events and fixed functional features that will specify the potential evolutionary history of the protein in specific phyla. An interactive tool will also be developed to localise and display the genetic events at specific branches of the gene’s phylogenetic tree which will allow in-depth analysis of specific genes, for example, to detect inconsistencies that might suggest a functional shift or to reconstruct ancestral proteins.
Analysis of the evolutionary histories at the human proteome-scale
The formal specifications and the methodologies developed will be used to reconstruct the evolutionary histories for the complete human proteome. For each of the approximately 35000 genes, homologs will be identified in the currently available complete vertebrate genomes. The evolutionary histories of the proteins will then be analysed and classified to define sets of typical evolutionary histories, which will be exploited in proteome scale analyses, for example, to compare protein families with stable or unstable evolutionary rates, or to determine the set of proteins that have never, or frequently, experienced specific events during the vertebrate evolution, such as gene duplications, domain fusions or insertions, N-terminal extensions... We will then perform a structural/functional analysis of the protein sets corresponding to each typical history, in order to detect potential enrichment of a particular class of proteins, for example, informational proteins or proteins involved in specific biological process. Finally, for a specific number of the proteins identified in the analysis, the relations defined in the ontology will be exploited in order to map the protein level events to the available complete vertebrate genomes. Two distinct sets of proteins will be primarily analysed; namely, the proteins that have experienced major N-terminal extension or insertion and proteins that exhibit potential true ortholog loss. When data are available, these two protein sets will be studied to characterise potential correlations between N-terminal region genetic events and promoter or transcriptional behaviour shifts in the vertebrate lineage or between orthology losses and macromolecular complex or biological pathway modifications.