Molecular and bioinformatic techniquesContentsIntroductionOver the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by researchers. This deluge of genomic information has, in turn, led to an absolute requirement for computerised databases to store, organise and index the data, and for specialised tools to view and analyse the data. The different kinds of structural analysis led to several different kinds of databases. The molecular biology have elongated in several different but strongly related working fields, which are mentioned here. GenomicsGenomics is the mapping of genes of human, animals, plants and microorganisms by large-scale DNA-sequence analysis. The large-scale research of the function of genes and how heredity compounds are stored in these genes are being translated to the functioning of a cell and the whole organism. Also "high-throughput" technologies like proteomics and metabolomics and the information assimilation and analysis of the very large amounts of complex data are included. Genomics is a set of technologies, which became indispensable in modern research (nationaal regieorgaan, 18 juli 2002).
TranscriptomicsTranscriptomics is the structural analysis of accumulated transcripts in a cell or tissue. Most research methods are setup to determine differences in expression between two cell cultures, like an infected potato and a health potato. The genes who have a different expression in the two cultures can give more information about the defense of the potato plant.
ProteomicsProteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes. A new fundamental concept called proteome (PROTEin complement to a genOME) has recently emerged that should drastically help phenomics to unravel biochemical and physiological mechanisms of complex multivariate diseases at the functional molecular level.
BioinformaticsBioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data. Bioinformatics is an interdisciplinary research area that is the interface between the biological and computational sciences. The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass of data and obtain a clearer insight into the fundamental biology of organisms. This new knowledge could have profound impacts on fields as varied as human health, agriculture, the environment, energy and biotechnology.
Why is bioinformatics important?The genome sequencing projects has produced large amounts of nucleotide and protein sequence data. Traditionally, molecular biology research was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data being produced in this genomic era has seen a need to incorporate computers into this research process.
The integration of information learned about these key biological processes should allow us to achieve the long-term goal of the complete understanding of the biology of organisms. (http://www.ebi.ac.uk/2can/bioinformatics/bioinf_what_1.html)
DatabasesAt the beginning of the "genomic revolution," a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues, but also the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data. Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analysing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:
Biological databasesA biological database is a large, organised body of persistent data, usually associated with computerised software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met:
EntrezAt the site of the NCBI, many of the databases are linked through a unique search and retrieval system, called Entrez. Entrez allows a user to not only access and retrieve specific information from a single database, but to access integrated information from many NCBI databases. For example, the Entrez protein database is cross-linked to the Entrez taxonomy database. This allows a researcher to find taxonomic information of the protein of interest. An overview of the most important databases is given in the part Databases on this site. (http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html). Gene mappingGenetic mapJust like interstate maps have cities and towns that serve as landmarks, genetic maps have landmarks known as genetic markers, or "markers" for short. The term marker is used very broadly to describe any observable variation that results from an alteration, or mutation, at a single genetic locus. A marker may be used as one landmark in a map if, in most cases, that stretch of DNA is inherited from parent to child according to the standard rules of inheritance. Markers can be within genes that code for a noticeable physical characteristic such as leaf colour, or a not so noticeable trait such as a disease. The greater the distance between two linked genes, the greater the chance that two nonsister chromatids would cross over in the region between the genes and the greater the proportion of recombinants that would be produced. Thus, by determining the frequency of recombinants, we can obtain a measure of map distance between the genes. Today, several other genetic markers are used to detect linkage. There are several genetic markers:
Currently, the most powerful mapping technique, and one that has been used to generate many genome maps, relies on Sequence Tagged Site (STS) mapping. A STS is a short DNA sequence that is easily recognisable and occurs only once in a genome (or chromosome).
ESTs as Gene Discovery ResourcesESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene.Because ESTs represent a copy of just the interesting part of a genome, that which is expressed, they have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary diseases. ESTs also have a number of practical advantages in that their sequences can be generated rapidly and inexpensively, only one sequencing experiment is needed per each cDNA generated. ESTs are powerful tools in the hunt for known genes because they greatly reduce the time required to locate a gene. Using this method, scientists have already isolated genes involved in Alzheimer's disease, colon cancer, and many other diseases.
Cytogenetic mapA cytogenetic map is the visual appearance of a chromosome when stained and examined under a microscope. Particularly important are visually distinct regions, called light and dark bands, which give each of the chromosomes a unique appearance. This feature allows a person's chromosomes to be studied in a clinical test known as a karyotype, which allows scientists to look for chromosomal alterations.
Physical mapA physical map is a collection of overlapping clones that have been arranged into a tiling path based on either fingerprinting (digestion of clones with restriction enzymes and comparison of the fragment sizes) or hybridisation. The genetic markers can help to integrate these three maps mentioned above. For Arabidopsis thaliana, TAIR's comprehensive MapViewer is an integrated graphic display of each Arabidopsis chromosome. TAIR is the internet site where all information and data about Arabidopsis is combined. MapViewer shows genetic, physical, and sequence maps in one site and allows users to search, browse, and align different maps in a region of interest. In the future, all the maps will be fully integrated into a genome map for the organism. Genome sequencingWhat is sequencing?Sequencing is the method to get the order of DNA basepairs
of a DNA fragment. This fragment can be small (like 500 bp) or a whole genome
of an organism. One of the major methods of DNA sequencing in
known
as chain termination sequencing, dideoxy
sequencing,
or
Sanger sequencing
after
its
inventor biochemist Frederick Sanger. The method is
elegantly simple. While DNA chains are normally made up of deoxynucleotides
(dNTPs), the Sanger
method uses also dideoxynucleotides. Dideoxynucleotides (ddNTPs) are missing a hydroxy (OH) group at the 3' position. This position is normally where one nucleotide attaches to another to form a chain. If there is no OH group in the 3' position, the additional nucleotides cannot be added to the chain, thus interrupting chain elongation. A small fraction of one of the bases will contain stopnucleotides. This means that everytime that nucleotide is added, a fraction of the strings will stop growing and keep the length it has reached at that time. When you first devide your sample in four tubes, you can do this procedure four times. This means with all the four basepairs. When you run these four samples on a gel, you can read the sequence from the smallest fragment to the largest.. Since 1986 the process of reading the sequence can be done
with an automated fluorescence sequencer. The automated
sequencer runs on the same principle as the
Sanger
method (dideoxynucleotide
chain termination).
But here a laser
constantly
scans
the
bottom of the gel, detecting the bands that move down the gel. Where the
manual method uses radioactive labeling, automated cappilary sequencing uses fluorescent
tags on the ddNTPs (a different dye for each nucleotide). This makes
it possible for all four reactions (dGTP, dATP, dCTP, and dTTP) to be run in
one lane and
increases the speed of the process four times. The runs are fully automated
nowadays, and the gels are replaced by cappilaries. This is a very efficient
method and is very useful for fast and automatically sequencing of large DNA
fragments. More about sequencing on the history page.
There are two methods of deviding the genome in smaller parts for large-scale sequencing:
The Conventional MethodOnce scientists use PCR to create many copies of a single strand of the DNA fragment they begin to synthesis the location of each letter. The original method involves the following:
This is the method that was first developed. Fortunately, contemporary institutes no longer use this exact method but one that is four times faster. By using ddNTPs tagged with fluorescents they no longer need four test tubes, but a single one, which contains all four fluorescent ddNTPs. They then rely on computers to detect the different colours of the ending pieces of the DNA segments after Gel Electrophoresis to determine the letters and length of the DNA chain. More information is given in the History part of this website.
The Shotgunning MethodFor using the shotgunning method, first A genomic library is made by cutting a whole genome with restriction enzymes and inserting each piece into a bacterium, which is then cloned. These segments are then detected and ordered by computers.
The advantage of using smaller fragments of the larger DNA chain is that since the time required sequencing the DNA has been greatly shorted. Therefore, machines can sequence the fragments many times in order to achieve a high level of accuracy, by using sequencing software which lines up the DNA by finding overlapping letter sequences in the many pieces after the gel electrophoresis. However, scientists have experienced a few problems when 'shotgunning' DNA strings with many common and reoccurring sequences; therefore researchers using the 'shotgunning' process often sequence the DNA both backwards and forwards to return more accurate results. The following graphic shows how the small DNA fragments are realigned to assemble the full sequence. Actuality billions of overlapping DNA pieces need to be aligned for an acceptable accuracy. Fortunately, by highly automating the 'shotgunning' method of sequencing, scientists are quickly organising enough DNA pieces to return precise chains faster than competitors using more conventional techniques. (http://www.uweb.ucsb.edu/~trevorc/techseq.html)
ESTs are also very useful in the mapping of a genome. The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene. These ESTs gives much information as a reliable genomic landmark for genome mapping. There are two types of shotgunning:
Hierarchical shotgunning means, the genome will be broken up into overlapping segments whose relative locations were known; each segment was then shotgun-sequenced. Using the whole genome shotgunning technique, the whole genome will be broken in pieces several times and all pieces are sequenced. Both types detect and order the segments by computer after sequencing each segment. The whole genome shotgunning was invented by Craig Venter's TIGR and the technique was used to sequence several genomes, like the influenza microbe, Drosophila melanogaster and Venter's part of the human genome.
Protein analysisWhat is protein analysis?Proteins are responsible for maintaining all cellular functions and their production is governed by the genetic code. A disease may be the result of gene mutations that cause changes in the structure and activity of a protein. Therefore, characterising proteins and understanding their function is important for identifying novel drug targets and designing more effective medicines. Protein analysis is the study of the total protein content of a cell type or organism. It gives a better understanding of the function of genes and the proteins for which they code. This knowledge is of fundamental importance for the development of molecular medicine. Protein analysis is a complex process that requires multiple steps and advanced technologies. Two of the most important technologies are Mass Spectometry and 2D gel elctroforesis. Mass SpectometryMatrix-assisted laser desorption/ionisation-time of flight mass spectrometry (MALDI-TOF MS) is a relatively novel technique in which a co-precipitate of an UV-light absorbing matrix and a biomolecule is irradiated by a nanosecond laser pulse. Most of the laser energy is absorbed by the matrix, which prevents unwanted fragmentation of the biomolecule. The ionised biomolecules are accelerated in an electric field and enter the flight tube. During the flight in this tube, different molecules are separated according to their mass to charge ratio and reach the detector at different times. In this way each molecule yields a distinct signal. The method is used for detection and characterisation of biomolecules, such as proteins, peptides, oligosaccharides and oligonucleotides, with molecular masses between 400 and 350,000 Da. It is a very sensitive method, which allows the detection of low amounts of molecules. Protein identification by this technique has the advantage of short measuring time (few minutes) and negligible sample consumption (less than 1 pmol) together with additional information on microheterogeneity (e.g. glycosylation) and presence of by-products. Although molecular biology has provided powerful techniques for DNA analysis, this is not yet reflected in protein analysis. Genome sequencing has yielded a wealth of information on predicted gene products, but for the majority of the expressed proteins no function is known. Proteomics is an important new field of study of protein properties (expression levels, interactions, post-translational modifications etc.) and thus can be described as functional genomics at the protein level. The mass accuracy of MALDI-TOF MS will be sufficient to characterise proteins (after tryptic digestion) from completely sequenced genomes. (http://www-micrbiol.sci.kun.nl/tech/malditof.html)
2D GelelectroforesisBy comparing the proteinaceous composition (the proteome) of microorganisms, the contribution of differentially expressed proteins to specific microbial traits can be assessed. The proteome is visualised by 2D-gelelectrophoresis and proteins of interest can be identified on the genome by MALDI-TOF MS. Two dimensional gel electrophoresis separates proteins first on their isoelectric point and in the second dimension on their molecular mass. Through hydration proteins are absorbed by dried gelstrips which carry an immobilised pH gradient. The hydrated first dimensional gel is subjected to a strong electric field. Acid proteins at the alkaline side of the gel will dissociate and become negatively charged. Because of the electric field these proteins will migrate to the positive pole (the acidic side of the gel). Proteins will reach a point where they become neutralised by the gel, lose their net charge and do not migrate further. The same scenario applies to basic proteins. With the currently available equipment it is possible to reproducably separate proteins which differ in only a single pH unit over 24 cm. (http://www-micrbiol.sci.kun.nl/tech/twod.html) Gene expression analysisNorthern blottingNorthern blotting is a laboratorium technique to analyse
RNA expression. The experiment takes several steps. First the total
amount of RNA is isolated. Then it is separated on fragment length by gel
electroforesis. Then it is transferred to nitrocelloluse or nylon filter
paper. The filter can
then
used
to search
for a particular
RNA by several probing techniques, for instance radioactive labeling of the
probe. The probe should be complement to the RNA you are looking for. This
simple procedure can indicate in which tissues or cell types a particular
gene is expressed.
In this way a Northern blot is often used for diagnostic purposes. It can also
be used to confirm results from other experimental technques like microarray
.
Micro Array AnalysisWhat is microarray analysis?Microarray technology makes use of the sequence resources created by the genome projects and other sequencing efforts to answer the question, what genes are expressed in a particular cell type of an organism, at a particular time and under particular conditions. For instance, they allow comparison of gene expression between normal and diseased cells. There are several names for this technology: DNA microarrays, DNA arrays, DNA chips, gene chips, others. Sometimes a distinction is made between these names but in fact they are all synonyms as there are no standard definitions for which type of microarray technology should be called by which name. Microarray technology and applications
There are different ways how microarrays can be used to measure the gene expression levels. One of the most popular microarray applications allows the comparison of gene expression levels in two different samples, e.g., the same cell type in a healthy and diseased state. This is called cDNA microarray. An other array technique is oligo arrays. cDNA MicroarrayIn the preparation of a cDNA microarray, the total mRNA from the cells in two different conditions is extracted and reverse transcription PCR (RT-PCR) is used to convert the RNA transcripts into cDNA. The cDNAs are usually composed of 500 -2000 basepairs long. The complete pool of cDNA is representative of transcriptional events in the tissue source of the RNA. The genes that were being actively transcribed in the sample will have mRNA copies that should have been first purified and then copied into cDNA during the RT-PCR step. The reverse transcription event for the control and experimental mRNA are identical in every step except one, and it is this step that enables differential gene expression to be determined. Nucleotides labelled with a green fluorescent dye Cy3 are incorporated into the control cDNA, while nucleotides labelled with a red fluorescent dye Cy5 are incorporated into the experimental DNA. After preparation, both probes are mixed and allowed to hybridise to the glass slide. Excess hybridisation buffer is washed off following an overnight incubation, and the slides are then ready to be scanned. Labelled gene products from the extracts hybridise to their complementary sequences in the spots due to the preferential binding - complementary single stranded nucleic acid sequences tend to attract to each other and the longer the complementary parts, the stronger the attraction.
Oligonucleotide MicroarrayThe physical chemistry of hybridisation is oligonucleotide microarrays is clearly different from that of cDNA microarrays. Oligonucleotides range in size from 10-25 bases. So, the DNA fragments in the spots are much smaller than cDNA fragments are. Oligonucleotide microarrays are used to detect point mutations (the missing, adding or changing of a single base) in a known DNA sequence. Single base mismatches do have much more influence on binding to an oligonucleotide sequence compared to cDNA. For example, a small genome can be synthesized on a chip as a set of thousands of 20 bp long fragments. When a single basepair match exists, the fluorescence intensity decreases significant. This technique gives possibilities to find most of the point mutations in a known DNA sequence. Data quantification
The raw data that are produced from microarray experiments are the hybridised microarray images. To obtain information about gene expression levels, these images should be analysed, each spot on the array identified, its intensity measured and compared to the background. This is called image quantification and is done by image analysis software. To obtain the final gene expression matrix from spot quantification's, all the quantities related to some gene (either on the same array or on arrays measuring the same conditions in repeated experiments) have to be combined and the entire matrix has to be scaled to make different arrays comparable. Microarrays are already producing massive amounts of data. These data, like genome sequence data, can help us to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs. The EBI as well as the NCBI are establishing a public repository for microarray gene expression data analogous to banks for DNA sequence data. Microarray is fundamentally a technique to identify complete gene expression profiles in selected tissues. Microarray experiments can give false positive and false negative results. Additional means of analysing gene expression (Northern blotting or RNAse protection assays) must be used to control microarray conclusions (http://www.ebi.ac.uk/2can/databases/microarray.html)
back to topSequence alignmentWhat is sequence alignment?Alignment is the result of a comparison of two or more gene or protein sequences in order to determine their degree of base or amino acid similarity. Sequence alignments are used to determine the similarity, homology, function or other degree of relatedness between two or more genes or gene products. The likelihood that the two sequences are related is represented in an alignment score. This score is calculated by totaling the scores for each matched pair of residues at each position in the alignment. The general approach for similarity searching involves the use of a set of algorithms such as the BLAST programs to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. The higher the score, the greater the degree of similarity. The similarity is measured and shown by aligning two sequences. Alignments can be global or local. A global alignment is an optimal alignment that includes all characters from each sequence, whereas a local alignment is an optimal alignment that includes only the most similar local region or regions. Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance. The similarity itself, is not a sufficient indicator of function.
|
The Four Steps of Protein Modeling
|
Phylogenetic systematics is that field of biology that does deal with identifying and understanding the evolutionary relationships among the many different kinds of life on earth, both living (extant) and dead (extinct).
Systematics describes the pattern of relationships among taxa and is intended to help us understand the history of all life. But history is not something we can see—it has happened once and leaves only clues as to the actual events. Scientists use these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the most convenient way of visually presenting evolutionary relationships among a group of organisms is through illustrations called phylogenetic trees.
The field of molecular phylogenetics has grown, both in size and in importance, since its inception in the early 1990s, attributable mostly to advances in molecular biology and more rigorous methods for phylogenetic tree building. The importance of phylogenetics has also been greatly enhanced by the successful application of tree reconstruction, as well as other phylogenetic techniques, to more diverse and perplexing issues in biology. Phylogenies are used essentially by drawing inferences from the structure of the tree or from the way the character states map onto the tree. Broadly speaking, the relationships established by phylogenetic trees often describe a species' evolutionary history and, hence, its phylogeny, the historical relationships among lineage's or organisms or their parts, such as their genes.
Molecular phylogenetics attempts to determine the rates and patterns of change occurring in DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two general approaches may be taken to obtain this information. In the first approach, scientists use DNA to study the evolution of an organism. In the second approach, different organisms are used to study the evolution of DNA. Whatever the approach, the general goal is to infer process from pattern: the processes of organismal evolution deduced from patterns of DNA variation and processes of molecular evolution inferred from the patterns of variations in the DNA itself.
Studies of gene and protein evolution often involve the comparison of homologs, sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary level of similarity determined by alignment of matching bases that are homologous. These sequences are inherited from a common ancestor that possessed similar structure, although the ancestor may be difficult to determine because it has been modified through descent.
A straightforward phylogenetic analysis consists of four steps:
|
Metabolomic involves the systematic estimation of small molecules from a range of organisms, followed by statistical analyses and other investigations of that large quantity of data.
The draft sequences for the human genome has focused attention on the tremendous effort still required to understand the function of expressed genes, and the way in which genes and the proteins they encode interact within cells and organisms. Metabolomics, or the global analysis of cellular metabolites, provides a powerful new tool for gaining insight into functional biology. Snapshots of the level of small molecules within a cell, and how those levels change under different conditions, are complementary to gene expression and proteomic studies, and are actively being applied to studies of infectious diseases, production and model organisms, as well as human cells and plants.
It requires analytical techniques such as chromatography, molecular spectroscopy and mass spectrometry, coupled with multivariate data analysis methods. The aim of metabolomics is to obtain the widest possible coverage, in terms of the type and number of compounds analysed. This is achieved by making use of several, complementary analytical methods.
For target compound analysis and metabolic profiling, main techniques are gas chromatography (GC), high performance liquid chromatography (HPLC) and nuclear magnetic resonance (NMR). These approaches rely on chromatographic separations, often coupled with well-developed calibrations for specific analytes.
In metabolic fingerprinting, samples are analysed as crude extracts without any separation step, using NMR, direct injection mass spectrometry (MS), or Fourier transform infrared (FT-IR) spectroscopy. These fingerprinting approaches are often combined with multivariate analysis, to get the most out of the data.
Developments involving gas chromatography have been responsible for the recent upsurge of interest in plant metabolomics. GC provides high-resolution compound separations, and can be used in conjunction with a flame ionisation detector (GC/FID) or a mass spectrometer (GC/MS). Both detection methods are highly sensitive and universal, able to detect almost any organic compound, regardless of its class or structure. However, most of the metabolites found in plant extracts are too involatile to be analysed directly by GC methods. The compounds have to be converted to less polar, more volatile derivatives before they are applied to the GC column. Efficient derivatisation methods are available, but relatively low sample throughput is a drawback of the GC method, particularly when there are many samples to be examined.
HPLC, with UV detection, is probably the most common method used for targeted analysis of plant materials, and for metabolic profiling of individual classes. A derivatisation step is not essential (unless needed for detection), since involatile and volatile substances may be measured equally well. Selection of compounds arises initially from the type of solvent used for extraction (as with all methods that use an extraction step), and then from the type of column and detector. For example HPLC/UV will only detect compounds with a suitable chromophore; a column selected for its ability to separate one class of compounds will not generally be useful for other types. HPLC profiling methods all rely to a great extent on comparisons with reference compounds. The full UV spectrum (measured for each peak when UV-diode array detectors are used) gives some useful information on the nature of compounds in complex profiles, but often indicates the class of the compound rather than its exact identity.
In principle, proton (1H) NMR can detect any metabolites containing hydrogen. Signals can be assigned by comparison with libraries of reference compounds, or by two-dimensional NMR. 1H NMR spectra of plant extracts are inevitably crowded not only because there is a large number of contributing compounds, but also because of the low overall chemical shift dispersion. 1H spectra are also complicated by spin-spin couplings which add to signal multiplicity, although they are an important source of structural information. In 13C NMR, the chemical shift dispersion is twenty times greater and spin-spin interactions are removed by decoupling. Despite these advantages, the low sensitivity of 13C NMR prevents its routine use with complex extracts. Sensitivity can be enhanced when seedlings are grown in the presence of 13C enriched carbon dioxide, but this is obviously only an option for laboratory based studies.
It is also possible to obtain metabolite 'mass profiles' without any chromatographic separation. Such profiles are obtained by injecting crude extracts into the electrospray ionisation source of a high-resolution mass spectrometer. This technique generates mainly protonated, deprotonated or adduct molecules, such as [M+H]+, [M+cation]+ or [M-H]- for each species present in the mixture, with little or no fragmentation. Thus a fingerprint spectrum is obtained with a single peak for each metabolite, separated from other metabolites according to (accurate) molecular mass. The fingerprint can be used as a classification tool, for example in taxonomy. Some mass analysers are capable of ultra-high resolution and permit the mass to be determined to four or five decimal places. This allows unique formulae to be assigned to peaks with masses of a few hundred or so. The coupling of high sensitivity with high resolution provides a method of determining a rough estimate of the number of metabolites present and a valuable first indication, from the formulae, of their identities. Its main weakness is the inability to separate isomers of the same molecular mass.
Plant extracts are very complex in composition and, if many samples are examined, it is difficult to make meaningful comparisons of large numbers of spectra or chromatograms 'by eye'. Multivariate statistical methods can be extremely useful, as they are able to compress data into a more easily managed form. This can assist in visualizing, for example, how a given sample relates to other samples - a central issue in metabolomics. Multivariate analysis is practically essential in the fingerprinting approaches, but is also helpful in techniques where individual metabolites are explicitly quantified (eg GC/MS).
Principal component analysis (PCA) is a well-known and effective method of data compression. PCA transforms the original data (e.g. intensity values in a spectrum) into a set of 'scores' for each sample, measured with respect to the principal component axes ('loadings'). The PC scores replace the original variates, and are: (i) ordered, with successive PCs accounting for decreasing amounts of variance, and (ii) orthogonal, with no correlation between the scores on different axes. Due to these properties, a small number of PCs can replace the many original variates without much loss of information.
Scatter plots of the scores on the first few PC loadings provide an excellent means of visualizing and summarising the data and often reveal patterns that cannot be discerned in the original data. The scores plots may show clustering of similar samples, separation of different sample types, or the presence of outliers. Plots of the loadings themselves may be used to explore which compounds are most responsible for, say, separating samples into groups: the most important compounds (peaks) tend to correspond to high absolute loading values.
(http://www.metabolomics-nrp.org.uk/techniques.html)
Under construction.
Last Modified 15 July 2003 by SK |