1. Brief Introduction
Bioinformatics is a novel emerging field of interdisciplinary sciences, which integrates biology, computer science, and information technology into a new discipline. Our group’s objectives concentrate on the development of new algorithms and statistical approaches, which can be applied to help the biologists improve the quality and efficiency to process their large-scale biological data.
Currently, our group has ten members, including three faculties, five PhD students, and two Master students. Our current project lies in the research and development of proteomics via information techniques, such as database, algorithmic and statistics. The fund is supported by the National Fundamental Research 973 Program of China.
2. Research Directions
Currently there are three main research directions in our group as follows:
Protein Identification based on database searching and tandem mass spectra;
Isotopic Patterns in Tandem Mass Spectra;
Protein/RNA 3-D Structural Classification and Prediction;
Protein Identification based on database searching and tandem mass spectra
"Proteome" is a whole set of proteins expressed by a cell, a tissue, or an organism. Proteomics is a strategic strongpoint of Functional Genomics. Currently, the typical route of technology is: sample preparation, protein separation, abundance analysis and protein identification. The technologies adopted are: two dimensional gel electrophoresis (2-DE) (separation), mass spectrometry(MS) (analysis), and database searching (identification). In the process of protein identification, it is distinctly not feasible and practical to manually identify proteins using the mass generated spectra data from mass spectrometry. Thus, it becomes the most important issue to investigate how to introduce the technologies of bioinformatics into this field which adopt the methods of machine learning, data mining, and artificial intelligence to accurately identify proteins in high-throughput using mass spectra against current protein sequence databases. The principles of protein identification based on mass spectra and database searching are explained simply as follows: The process of 2-DE spread the protein mixture around respectively in two perpendicular directions: molecular weight and isoelectric point, then one point of the separated protein mixture is digested by a certain enzyme(e.g., trypsin) to generate peptides. The peptides are ionized and separated by mass spectrometry(e.g., MALDI-TOF, ESI-Ion Trap, etc.), then the m/z ratios of the ionized peptides are measured on ion detector. All the ratios constitute a mass spectrum which can act as an ID of a peptide to search in protein or peptide databases. The above process can be called Peptide Mass Fingerprint(PMF) Mapping which adapts to the situation that protein sample is more simplex. One of m/z ratios is selected in the above process and fragmentated through collision-induced dissociation(CID) to generate fragment ions whose m/z ratios can be measured. That process is called tandem mass spectrometry(MS/MS). We can identify more complex protein mixture by MS/MS than by PMF. The basic flow of protein identification consists of preprocessing of spectra, interpreting and filtering, scoring, and evaluation of search results. The object of preprocessing is noise suppressing, data reduction, and information extraction. Interpreting and filtering is to eliminate the invalid sequence from database in terms of composition information, sequence tag, isoelectric point, etc. Scoring is the main module of system and there many mathematical methods to support it, such as dot product of vector, cross-correlation, statistics. Evaluation measures the confidence of scoring results. Our production, pFind, making use of scoring function based on kernel-spectral vector dot product, can achieve more accuracy than traditional methods. The result of experiment is competitive with or better than some equivalent software(e.g., SEQUEST, Sonar MS/MS).
Isotopic Patterns in Tandem Mass Spectra
In mass spectral data, there are many noise peaks produced by possible contaminant sources such as chemicals used during the sample preparation process, keratin (an ubiquitous protein found in skin and hair), and chemicals used to visualize proteins before they are excised from the 2D gel, etc. Also, peaks overlapping are frequently observed in spectra, i.e. two or more different ions have the similar masses or different ions have the confused isotopes masses. It is difficulty to recognize the mono-isotopic masses of all the valid ions. Most often the accuracy of Quadrupole-time of flight mass spectrometers is limited by an imperfect calibration. For example, temperature shifts in the laboratory of 20 o C can be sufficient to limit the accuracy of the mass assignments. High accuracy is important to limit the number of possible candidates in a database search or De Novo process and thereby increase the identify specificity.
As we know, the elements of H, C, N, O, and S have (different) stable isotope distributions in nature. Most proteins are composed of the above five elements, thereby, have relatively stable corresponding isotope distributions (or isotope patterns). Then we can utilize these isotopic patterns to improve the peptide identification. A valid fragment ion of a given peptide will have a group of isotopic peaks in the spectrum but the noise will not. This attribute can be used to distinguish the valid ions from noise peaks. In addition, we can recognize and split the overlapping peaks. Since each single fragment ion has a unique isotope pattern, an observed isotope pattern is considered as overlapping and is split subsequently if it is valid but violates the expected pattern.
We have proposed a method, FFP (Fragment ion Formula Prediction), to predict elemental component formulas of ions based on the isotope patterns presented in experimental tandem spectra. We use the masses of the predicted elemental component formulas as the standard to calibrate the mass error conducted by mass spectrometers and improve the mass accuracy as a result.
Protein/RNA 3-D Structural Classification and Prediction
Given a protein whose function is unknown, using some structure align methods to search its similar structures in protein structure database (PDB) and forecast its property and function has important significance in protein research. Our work aims at the protein structure's global alignment. To say in detail, the work including content below:
1).Protein 3D structure feature extraction.
The features extracted should be invariant in rotation, sift and size. At the same time we should try to keep the similarity measurement based on those features fit with the experience of experts.
2). Protein 3D structure's feature statistics model.
All the structures in PDB can be extracted features. Similarly, we can extract features from a function unknown protein's 3D structure and on this base, we can use mathematics statistical model to perform fast and optimal match to find the most similar structures in protein database.
3). Protein 3D structure classification.
Investigating the method to align protein 3D structures merging the biologists' experience, mainly focus on the biologist's definition and understanding of protein similarity. On this base, we excavate the relations between structure and function using the technology of machine learning, and apply the biologist's experience to the protein structure's matching.
We have constructed the system of Protein Structure Align(PSA 1.0). It has following function:
1). Retrieve protein in the PDB based on protein's geometry similarity.
2). Initialize the matrixes of protein's subspace and the weight.
3). Define or select protein training set to optimize the feature weight matrix.
4). Retrieve the protein set in the PDB and evaluate the performance of the system.
Currently the research area has now been extended to RNA structure.
The number of RNA structures whose coordinates are available in the Protein Data Bank (PDB) and the Nucleic Acid Database (NDB), though small compared with the number of protein structures available, is substantial and rapidly growing.
Non-coding RNA (ncRNA) molecules are those RNAs that do not encode proteins, but instead serve some other function in the cell. They play a variety of critical roles and are ubiquitous in all kingdoms of life. The function of non-coding RNAs is uniquely determined by the three dimensional structure of the molecule. In order to organize the non-coding RNA information and make it available to the non-specialist, to discover new features of RNA structure and relationships to sequence and function, and to enumerate and classify substructures for model building and RNA engineering, we are developing a database for the Structural Classification of ncRNA.
RNA sequences are being determined at a rate faster than the solutions of their three-dimensional structures. Although experimental structure determination methods are providing high-resolution structure information about a subset of the RNA, computational structure prediction methods will provide available information for the large fraction of sequences whose structures will not be determined experimentally. As structural information is critical to our understanding of the basis of the biological properties of RNA molecules, there is a tremendous incentive to develop computational methods for obtaining such information. We are developing a system for predicting RNA three-dimensional structures including pseudoknots from sequences.
He Simin(Direction Leader)
Sun Ruixiang(Group Leader)
Protein Identification based on database search: Fu Yan; Li Dequan; Wang Haipeng; Zou Cui; Wang Xiaobiao
Isotopic Patterns in Tandem Mass Spectra: Zhang Jingfen; Cai Jinjin
Protein/RNA 3-D Structural Classification and Prediction: Chen Xiang; Wang Shimin