ALEXSYS (ALignment EXpert SYStem) : Development of a new expert system for the creation, analysis and exploitation of biological sequence alignments
The last decade has provided access to a large amount of data resulting from high throughput genomic technologies, such as transcriptomics, proteomics or interactomics. This wealth of data now means that it is possible to perform detailed studies of the complex molecular networks implicated in the essential processes of life. A critical step in these studies is the comprehension of the evolutionary processes involved (duplication, loss, recombination), since they determine the sequence, structure and function of macromolecules, and finally, the biological complexity of organisms.
As a consequence, comparative sequence analyses and phylogenetic inferences are increasingly important in biological systems studies and are indispensable in analyses of promoters, transcriptomes, proteomes and interactomes. Nevertheless, all these approaches are very sensitive to the algorithms used to compare the sequences, to reconstruct the evolutionary history of the genes, to identify important genetic events and to understand their consequences. In this context, the developments related to the construction and the effective analysis of a multiple alignment of complete sequences have, and will continue to have, a major effect on research connecting evolutionary models, adaptation or co-evolution to the comprehension of the networks in which the genes and their products play a fundamental role.
The objective of this PhD project is to develop an integrated expert system to test, evaluate and optimize all the stages of the construction and the analysis of a multiple sequence alignment. The new system will be validated within the context of existing benchmark cases and the ‘International Regulome Consortium' project whose goal is to identify and characterize the complete set of transcription factors and their `regulome' (complex regulatory networks) within several murine stem cells. The work will rely on the developments already achieved in the laboratory related to the construction and the analysis of multiple alignments (Plewniak at al, 2003 Nucleic Acids Res 31,3829-3832; Thompson et al, 2006 BMC Bioinformatics 23,7-318).
There exists today a large number of multiple alignment programs, based on very diverse algorithms. However, our recent studies have shown that none of these algorithms is able to provide a high quality multiple alignment for all possible conditions. Indeed, this work has established that the nature and the variability of the problems to be treated are extremely complex (errors in the sequences; divergent sequence lengths, modular organization, speed of evolution; presence of repeat sequences, transmembrane regions, circular permuations, etc) and that taking into account these various levels of complexity is essential to the realization of a multiple alignment of complete sequences (MACS) which is both accurate and reliable. It is clearly necessary to understand, not only the nature of the provided sequences but also, the strengths and weaknesses of the algorithms used, in order to obtain a high quality result in all alignment cases. Consequently, multiple sequence alignment methods must now evolve from a single isolated algorithm, towards an expert system, based on the co-operative application of different and complementary algorithms with a judicious use of additional knowledge (genomic, structural or functional).
The expert system will incorporate diverse components, covering aspects of genomic and protein data mining, validation and integration of structural/functional data, integrated with a set of different algorithms ensuring the construction, the refinement, the analysis and the exploitation of multiple sequence alignments. The combination of these elements in an entirely automated platform will be necessary and will be achieved using object oriented technologies. A suitable integration will also require the development of dependency models and standard ontologies, in order to make the transfer of information between the various modules as transparent as possible. The modular design will also facilitate the incorporation of new algorithms and will allow its future evolution.
An important element in the development of this system will be its ability to evaluate each module, not only at the level of its efficiency and its accuracy, but also depending on the type and the complexity of the biological data provided. The effective optimization of such a software network is primarily a process of investigation, which aimed at an in-depth comprehension of each module and its interactions with the various biological data types. This approach will require the incorporation of combinatorial, statistical and algorithmic concepts, with a continual biological validation of the results. This biological validation will be based on (i) `benchmarks' already developed in the laboratory (Thompson et al, 2005 Proteins 61:127 - 36) and (ii) a high throughput application concerning the study of the complete set of transcription factors, in collaboration with Dr. M. Andrade (Ottawa University, Canada), in the context of the International Regulome Consortium project.
In the long term, this expert system should allow the construction, the validation, the visualization and the interpretation of a high quality MACS, a fundamental tool in many fields of molecular biology and essential to the comprehension of complex biological systems.