Molecular and bioinformatic techniques

Contents

Introduction

Gene mapping

Genome sequencing

Protein analysis

Gene expression analysis

Micro Array Analysis

Sequence alignment

Protein modeling

Phylogenetics

Determination of metabolites

Clustering

Introduction

Over the past few decades, major advances in the field of molecular biology, coupled with advances in genomic technologies, have led to an explosive growth in the biological information generated by researchers. This deluge of genomic information has, in turn, led to an absolute requirement for computerised databases to store, organise and index the data, and for specialised tools to view and analyse the data. The different kinds of structural analysis led to several different kinds of databases. The molecular biology have elongated in several different but strongly related working fields, which are mentioned here.

Genomics

Genomics is the mapping of genes of human, animals, plants and microorganisms by large-scale DNA-sequence analysis. The large-scale research of the function of genes and how heredity compounds are stored in these genes are being translated to the functioning of a cell and the whole organism. Also "high-throughput" technologies like proteomics and metabolomics and the information assimilation and analysis of the very large amounts of complex data are included. Genomics is a set of technologies, which became indispensable in modern research (nationaal regieorgaan, 18 juli 2002).

 

Transcriptomics

Transcriptomics is the structural analysis of accumulated transcripts in a cell or tissue. Most research methods are setup to determine differences in expression between two cell cultures, like an infected potato and a health potato. The genes who have a different expression in the two cultures can give more information about the defense of the potato plant.

 

Proteomics

Proteomics can be defined as the qualitative and quantitative comparison of proteomes under different conditions to further unravel biological processes. A new fundamental concept called proteome (PROTEin complement to a genOME) has recently emerged that should drastically help phenomics to unravel biochemical and physiological mechanisms of complex multivariate diseases at the functional molecular level.

 

Bioinformatics

Bioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data.

Bioinformatics is an interdisciplinary research area that is the interface between the biological and computational sciences. The ultimate goal of bioinformatics is to uncover the wealth of biological information hidden in the mass of data and obtain a clearer insight into the fundamental biology of organisms. This new knowledge could have profound impacts on fields as varied as human health, agriculture, the environment, energy and biotechnology.

 

Why is bioinformatics important?

The genome sequencing projects has produced large amounts of nucleotide and protein sequence data. Traditionally, molecular biology research was carried out entirely at the experimental laboratory bench but the huge increase in the scale of data being produced in this genomic era has seen a need to incorporate computers into this research process.


Sequence generation, and its subsequent storage, interpretation and analysis are entirely computer dependent tasks. However, the molecular biology of an organism is a very complex issue with research being carried out at different levels including the genome, proteome, transcriptome and metabalome levels. Following on from the explosion in volume of genomic data, similar increases in data have been observed in the fields of proteomics, transcriptomics and metabolomics.


The first challenge facing the bioinformatics community today is the intelligent and efficient storage of this mass of data. It is important to provide easy and reliable access to this data. The data itself is meaningless before analysis and the sheer volume present makes it impossible for even a trained biologist to begin to interpret it manually. Therefore, incisive computer tools must be developed to allow the extraction of meaningful biological information.


There are three central biological processes around which bioinformatics tools must be developed:

  • DNA sequence determines protein sequence
  • Protein sequence determines protein structure
  • Protein structure determines protein function

The integration of information learned about these key biological processes should allow us to achieve the long-term goal of the complete understanding of the biology of organisms.

(http://www.ebi.ac.uk/2can/bioinformatics/bioinf_what_1.html)

 

Databases

At the beginning of the "genomic revolution," a bioinformatics concern was the creation and maintenance of a database to store biological information, such as nucleotide and amino acid sequences. Development of this type of database involved not only design issues, but also the development of complex interfaces whereby researchers could both access existing data as well as submit new or revised data.

Ultimately, however, all of this information must be combined to form a comprehensive picture of normal cellular activities so that researchers may study how these activities are altered in different disease states. Therefore, the field of bioinformatics has evolved such that the most pressing task now involves the analysis and interpretation of various types of data, including nucleotide and amino acid sequences, protein domains, and protein structures. The actual process of analysing and interpreting data is referred to as computational biology. Important sub-disciplines within bioinformatics and computational biology include:

  • The development and implementation of tools that enable efficient access to, and use and management of, various types of information;

  • The development of new algorithms (mathematical formulas) and statistics with which to assess relationships among members of large data sets, such as methods to locate a gene within a sequence, predict protein structure and/or function, and cluster protein sequences into families of related sequences.

Biological databases

A biological database is a large, organised body of persistent data, usually associated with computerised software designed to update, query, and retrieve components of the data stored within the system. A simple database might be a single file containing many records, each of which includes the same set of information. For example, a record associated with a nucleotide sequence database typically contains information such as contact name; the input sequence with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence. For researchers to benefit from the data stored in a database, two additional requirements must be met:

  • Easy access to the information;
  • A method for extracting only that information needed to answer a specific biological question.

Entrez

At the site of the NCBI, many of the databases are linked through a unique search and retrieval system, called Entrez. Entrez allows a user to not only access and retrieve specific information from a single database, but to access integrated information from many NCBI databases. For example, the Entrez protein database is cross-linked to the Entrez taxonomy database. This allows a researcher to find taxonomic information of the protein of interest. An overview of the most important databases is given in the part Databases on this site.

(http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html).

back to top

Gene mapping

Genetic map

Just like interstate maps have cities and towns that serve as landmarks, genetic maps have landmarks known as genetic markers, or "markers" for short. The term marker is used very broadly to describe any observable variation that results from an alteration, or mutation, at a single genetic locus. A marker may be used as one landmark in a map if, in most cases, that stretch of DNA is inherited from parent to child according to the standard rules of inheritance. Markers can be within genes that code for a noticeable physical characteristic such as leaf colour, or a not so noticeable trait such as a disease. The greater the distance between two linked genes, the greater the chance that two nonsister chromatids would cross over in the region between the genes and the greater the proportion of recombinants that would be produced. Thus, by determining the frequency of recombinants, we can obtain a measure of map distance between the genes. Today, several other genetic markers are used to detect linkage. There are several genetic markers:

  • RFLPs, or restriction fragment length polymorphism's, were among the first developed DNA markers. RFLPs are defined by the presence or absence of a specific site, called a restriction site, for a bacterial restriction enzyme. This enzyme breaks apart strands of DNA wherever they contain a certain nucleotide sequence;
  • VNTRs, or variable number of tandem repeat polymorphisms, occur in non-coding regions of DNA. This type of marker is defined by the presence of a nucleotide sequence that is repeated several times. In each case, the number of times a sequence is repeated may vary;
  • Microsatellite polymorphism's are defined by a variable number of repeats of a very small number of base pairs. Oftentimes, these repeats consist of the nucleotides, or bases, cytosine and adenosine. The number of repeats for a given microsatellite may differ between individuals, hence the term polymorphism--the existence of different forms within a population;
  • SNPs, or single nucleotide polymorphism's, are individual point mutations, or substitutions of a single nucleotide, that do not change the overall length of the DNA sequence in that region. SNPs occur throughout an individual's genome;
  • AFLP, or amplified fragment length polymorfism, is a DNA fingerprinting technique which detects DNA restriction fragments by means of PCR amplification.

Currently, the most powerful mapping technique, and one that has been used to generate many genome maps, relies on Sequence Tagged Site (STS) mapping. A STS is a short DNA sequence that is easily recognisable and occurs only once in a genome (or chromosome).

 

ESTs as Gene Discovery Resources

ESTs are small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene.Because ESTs represent a copy of just the interesting part of a genome, that which is expressed, they have proven themselves again and again as powerful tools in the hunt for genes involved in hereditary diseases. ESTs also have a number of practical advantages in that their sequences can be generated rapidly and inexpensively, only one sequencing experiment is needed per each cDNA generated. ESTs are powerful tools in the hunt for known genes because they greatly reduce the time required to locate a gene. Using this method, scientists have already isolated genes involved in Alzheimer's disease, colon cancer, and many other diseases.

 

Cytogenetic map

A cytogenetic map is the visual appearance of a chromosome when stained and examined under a microscope. Particularly important are visually distinct regions, called light and dark bands, which give each of the chromosomes a unique appearance. This feature allows a person's chromosomes to be studied in a clinical test known as a karyotype, which allows scientists to look for chromosomal alterations.

 

Physical map

A physical map is a collection of overlapping clones that have been arranged into a tiling path based on either fingerprinting (digestion of clones with restriction enzymes and comparison of the fragment sizes) or hybridisation.

The genetic markers can help to integrate these three maps mentioned above. For Arabidopsis thaliana, TAIR's comprehensive MapViewer is an integrated graphic display of each Arabidopsis chromosome. TAIR is the internet site where all information and data about Arabidopsis is combined. MapViewer shows genetic, physical, and sequence maps in one site and allows users to search, browse, and align different maps in a region of interest. In the future, all the maps will be fully integrated into a genome map for the organism.

back to top

Genome sequencing

What is sequencing?

Sequencing is the method to get the order of DNA basepairs of a DNA fragment. This fragment can be small (like 500 bp) or a whole genome of an organism. One of the major methods of DNA sequencing in known as chain termination sequencing, dideoxy sequencing, or Sanger sequencing after its inventor biochemist Frederick Sanger. The method is elegantly simple. While DNA chains are normally made up of deoxynucleotides (dNTPs), the Sanger method uses also dideoxynucleotides.

Dideoxynucleotides (ddNTPs) are missing a hydroxy (OH) group at the 3' position. This position is normally where one nucleotide attaches to another to form a chain. If there is no OH group in the 3' position, the additional nucleotides cannot be added to the chain, thus interrupting chain elongation. A small fraction of one of the bases will contain stopnucleotides. This means that everytime that nucleotide is added, a fraction of the strings will stop growing and keep the length it has reached at that time. When you first devide your sample in four tubes, you can do this procedure four times. This means with all the four basepairs. When you run these four samples on a gel, you can read the sequence from the smallest fragment to the largest..

Since 1986 the process of reading the sequence can be done with an automated fluorescence sequencer. The automated sequencer runs on the same principle as the Sanger method (dideoxynucleotide chain termination). But here a laser constantly scans the bottom of the gel, detecting the bands that move down the gel. Where the manual method uses radioactive labeling, automated cappilary sequencing uses fluorescent tags on the ddNTPs (a different dye for each nucleotide). This makes it possible for all four reactions (dGTP, dATP, dCTP, and dTTP) to be run in one lane and increases the speed of the process four times. The runs are fully automated nowadays, and the gels are replaced by cappilaries. This is a very efficient method and is very useful for fast and automatically sequencing of large DNA fragments.

More about sequencing on the history page.

 

There are two methods of deviding the genome in smaller parts for large-scale sequencing:

  • The Conventional Method
  • Shotgunning

 

The Conventional Method

Once scientists use PCR to create many copies of a single strand of the DNA fragment they begin to synthesis the location of each letter. The original method involves the following:

  • Step 1: Place identical DNA strands into four test tubes, each one containing a ddNTP that resembles one of the four nucleotides in DNA (A, T, C, G) and lots of dNTPs (which are free-floating nucleotides) that also resemble the DNA letter, except they do not build functioning DNA chains.
  • Step 2: Then add the polymerase enzyme and a known primer DNA, similar to the one used in PCR, to the test tubes. The primer marks the beginning of each sequenced string of DNA. In each test tube, the dNTPs, which act like letters in a DNA bond with the complementary nucleotide, thereby copying the original strand. However, the ddNTP in each test tube also bond with the DNA fragments at a probable ratio of 1 bond 100 times it could bond. Each time this happens the copy terminates, thereby creating millions of DNA strings of differing length that start with the same primer and each ending with the same ddNTP nucleotide. This is determined by which of the four test tubes being analysed, since only one of the ddNTPs are in each test tube.
  • Step 3: Then use Gel Electrophoresis to arrange the DNA pieces from largest to smallest and X-ray detection to determine the length of the strings.

This is the method that was first developed. Fortunately, contemporary institutes no longer use this exact method but one that is four times faster. By using ddNTPs tagged with fluorescents they no longer need four test tubes, but a single one, which contains all four fluorescent ddNTPs. They then rely on computers to detect the different colours of the ending pieces of the DNA segments after Gel Electrophoresis to determine the letters and length of the DNA chain. More information is given in the History part of this website.

 

The Shotgunning Method

For using the shotgunning method, first A genomic library is made by cutting a whole genome with restriction enzymes and inserting each piece into a bacterium, which is then cloned. These segments are then detected and ordered by computers.

  • Step 1: Blend the post PCR DNA string that is to be sequenced into little fragments.
  • Step 2: Place the new segments into a test tube filled with the polymerase enzyme and a primer bit of DNA.
  • Step 3: Now, like in the original method, let the DNA rebuilds itself with the dNTPs and the fluorescent ddNTPs tagged in the test tubes until all the strings have terminated.
  • Step 4: Use Gel Electrophoresis to sort the fragments by size and a computer to record the many DNA fragments lengths. Lastly the computer should process and realign these fragments into the original string, thereby sequencing it.

The advantage of using smaller fragments of the larger DNA chain is that since the time required sequencing the DNA has been greatly shorted. Therefore, machines can sequence the fragments many times in order to achieve a high level of accuracy, by using sequencing software which lines up the DNA by finding overlapping letter sequences in the many pieces after the gel electrophoresis. However, scientists have experienced a few problems when 'shotgunning' DNA strings with many common and reoccurring sequences; therefore researchers using the 'shotgunning' process often sequence the DNA both backwards and forwards to return more accurate results. The following graphic shows how the small DNA fragments are realigned to assemble the full sequence.

Actuality billions of overlapping DNA pieces need to be aligned for an acceptable accuracy. Fortunately, by highly automating the 'shotgunning' method of sequencing, scientists are quickly organising enough DNA pieces to return precise chains faster than competitors using more conventional techniques.

(http://www.uweb.ucsb.edu/~trevorc/techseq.html)

 

ESTs are also very useful in the mapping of a genome. The 3' ESTs serve as a common source of STSs because of their likelihood of being unique to a particular species and provide the additional feature of pointing directly to an expressed gene. These ESTs gives much information as a reliable genomic landmark for genome mapping.

There are two types of shotgunning:

  • Hierarchical (clone by clone)
  • Whole genome

Hierarchical shotgunning means, the genome will be broken up into overlapping segments whose relative locations were known; each segment was then shotgun-sequenced. Using the whole genome shotgunning technique, the whole genome will be broken in pieces several times and all pieces are sequenced. Both types detect and order the segments by computer after sequencing each segment. The whole genome shotgunning was invented by Craig Venter's TIGR and the technique was used to sequence several genomes, like the influenza microbe, Drosophila melanogaster and Venter's part of the human genome.

 

back to top

Protein analysis

 

What is protein analysis?

Proteins are responsible for maintaining all cellular functions and their production is governed by the genetic code. A disease may be the result of gene mutations that cause changes in the structure and activity of a protein. Therefore, characterising proteins and understanding their function is important for identifying novel drug targets and designing more effective medicines. Protein analysis is the study of the total protein content of a cell type or organism. It gives a better understanding of the function of genes and the proteins for which they code. This knowledge is of fundamental importance for the development of molecular medicine. Protein analysis is a complex process that requires multiple steps and advanced technologies. Two of the most important technologies are Mass Spectometry and 2D gel elctroforesis.

Mass Spectometry

Matrix-assisted laser desorption/ionisation-time of flight mass spectrometry (MALDI-TOF MS) is a relatively novel technique in which a co-precipitate of an UV-light absorbing matrix and a biomolecule is irradiated by a nanosecond laser pulse. Most of the laser energy is absorbed by the matrix, which prevents unwanted fragmentation of the biomolecule. The ionised biomolecules are accelerated in an electric field and enter the flight tube. During the flight in this tube, different molecules are separated according to their mass to charge ratio and reach the detector at different times. In this way each molecule yields a distinct signal. The method is used for detection and characterisation of biomolecules, such as proteins, peptides, oligosaccharides and oligonucleotides, with molecular masses between 400 and 350,000 Da. It is a very sensitive method, which allows the detection of low amounts of molecules.

Protein identification by this technique has the advantage of short measuring time (few minutes) and negligible sample consumption (less than 1 pmol) together with additional information on microheterogeneity (e.g. glycosylation) and presence of by-products. Although molecular biology has provided powerful techniques for DNA analysis, this is not yet reflected in protein analysis. Genome sequencing has yielded a wealth of information on predicted gene products, but for the majority of the expressed proteins no function is known. Proteomics is an important new field of study of protein properties (expression levels, interactions, post-translational modifications etc.) and thus can be described as functional genomics at the protein level. The mass accuracy of MALDI-TOF MS will be sufficient to characterise proteins (after tryptic digestion) from completely sequenced genomes.

(http://www-micrbiol.sci.kun.nl/tech/malditof.html)

 

2D Gelelectroforesis

By comparing the proteinaceous composition (the proteome) of microorganisms, the contribution of differentially expressed proteins to specific microbial traits can be assessed. The proteome is visualised by 2D-gelelectrophoresis and proteins of interest can be identified on the genome by MALDI-TOF MS. Two dimensional gel electrophoresis separates proteins first on their isoelectric point and in the second dimension on their molecular mass. Through hydration proteins are absorbed by dried gelstrips which carry an immobilised pH gradient. The hydrated first dimensional gel is subjected to a strong electric field. Acid proteins at the alkaline side of the gel will dissociate and become negatively charged. Because of the electric field these proteins will migrate to the positive pole (the acidic side of the gel). Proteins will reach a point where they become neutralised by the gel, lose their net charge and do not migrate further. The same scenario applies to basic proteins. With the currently available equipment it is possible to reproducably separate proteins which differ in only a single pH unit over 24 cm.

(http://www-micrbiol.sci.kun.nl/tech/twod.html)

back to top

Gene expression analysis

 

Northern blotting

Northern blotting is a laboratorium technique to analyse RNA expression. The experiment takes several steps. First the total amount of RNA is isolated. Then it is separated on fragment length by gel electroforesis. Then it is transferred to nitrocelloluse or nylon filter paper. The filter can then used to search for a particular RNA by several probing techniques, for instance radioactive labeling of the probe. The probe should be complement to the RNA you are looking for. This simple procedure can indicate in which tissues or cell types a particular gene is expressed. In this way a Northern blot is often used for diagnostic purposes. It can also be used to confirm results from other experimental technques like microarray .

Micro Array Analysis

What is microarray analysis?

Microarray technology makes use of the sequence resources created by the genome projects and other sequencing efforts to answer the question, what genes are expressed in a particular cell type of an organism, at a particular time and under particular conditions. For instance, they allow comparison of gene expression between normal and diseased cells. There are several names for this technology: DNA microarrays, DNA arrays, DNA chips, gene chips, others. Sometimes a distinction is made between these names but in fact they are all synonyms as there are no standard definitions for which type of microarray technology should be called by which name.

Microarray technology and applications

Microarrays exploit the preferential binding of complementary single-stranded nucleic acid sequences. A microarray is typically a glass slide, on to which DNA molecules are attached at fixed locations (spots). There may be tens of thousands of spots on an array, each containing a huge number of identical DNA molecules (or fragments of identical molecules), of lengths from twenty to hundreds of nucleotides. (According to quick napkin calculations by Wilhelm Ansorge and John Quackenbush in Schnookeloch in Heidelberg on 4 October 2001, the number of DNA molecules in a microarray spot is 107-108). For gene expression studies, each of these molecules ideally should identify one gene or one exon in the genome, however, in practice this is not always so simple and may not even be generally possible due to families of similar genes in a genome.

Microarrays that contain all of the approximate 6000 genes of the yeast genome have been available since 1997. The spots are either printed on the microarrays by a robot, or synthesised by photolithography (similarly as in computer chip productions) or by ink-jet printing. The spot diameter is of the order of 0.1 mm, for some microarray types can be even smaller.

There are different ways how microarrays can be used to measure the gene expression levels. One of the most popular microarray applications allows the comparison of gene expression levels in two different samples, e.g., the same cell type in a healthy and diseased state. This is called cDNA microarray. An other array technique is oligo arrays.

 

cDNA Microarray

In the preparation of a cDNA microarray, the total mRNA from the cells in two different conditions is extracted and reverse transcription PCR (RT-PCR) is used to convert the RNA transcripts into cDNA. The cDNAs are usually composed of 500 -2000 basepairs long. The complete pool of cDNA is representative of transcriptional events in the tissue source of the RNA. The genes that were being actively transcribed in the sample will have mRNA copies that should have been first purified and then copied into cDNA during the RT-PCR step. The reverse transcription event for the control and experimental mRNA are identical in every step except one, and it is this step that enables differential gene expression to be determined. Nucleotides labelled with a green fluorescent dye Cy3 are incorporated into the control cDNA, while nucleotides labelled with a red fluorescent dye Cy5 are incorporated into the experimental DNA. After preparation, both probes are mixed and allowed to hybridise to the glass slide. Excess hybridisation buffer is washed off following an overnight incubation, and the slides are then ready to be scanned. Labelled gene products from the extracts hybridise to their complementary sequences in the spots due to the preferential binding - complementary single stranded nucleic acid sequences tend to attract to each other and the longer the complementary parts, the stronger the attraction.

 

Oligonucleotide Microarray

The physical chemistry of hybridisation is oligonucleotide microarrays is clearly different from that of cDNA microarrays. Oligonucleotides range in size from 10-25 bases. So, the DNA fragments in the spots are much smaller than cDNA fragments are. Oligonucleotide microarrays are used to detect point mutations (the missing, adding or changing of a single base) in a known DNA sequence. Single base mismatches do have much more influence on binding to an oligonucleotide sequence compared to cDNA. For example, a small genome can be synthesized on a chip as a set of thousands of 20 bp long fragments. When a single basepair match exists, the fluorescence intensity decreases significant. This technique gives possibilities to find most of the point mutations in a known DNA sequence.

 

Data quantification

The dyes enable the amount of sample bound to a spot to be measured by the level of fluorescence emitted when a laser excites it. If the RNA from the sample in condition 1 is in abundance, the spot will be green, if the RNA from the sample in condition 2 is in abundance, it will be red. If both are equal, the spot will be yellow, while if neither are present it will not fluoresce and appear black. Thus, from the fluorescence intensities and colours for each spot, the relative expression levels of the genes in both samples can be estimated.

The raw data that are produced from microarray experiments are the hybridised microarray images. To obtain information about gene expression levels, these images should be analysed, each spot on the array identified, its intensity measured and compared to the background. This is called image quantification and is done by image analysis software. To obtain the final gene expression matrix from spot quantification's, all the quantities related to some gene (either on the same array or on arrays measuring the same conditions in repeated experiments) have to be combined and the entire matrix has to be scaled to make different arrays comparable.

Microarrays are already producing massive amounts of data. These data, like genome sequence data, can help us to gain insights into underlying biological processes only if they are carefully recorded and stored in databases, where they can be queried, compared and analysed by different computer software programs. The EBI as well as the NCBI are establishing a public repository for microarray gene expression data analogous to banks for DNA sequence data.

Microarray is fundamentally a technique to identify complete gene expression profiles in selected tissues. Microarray experiments can give false positive and false negative results. Additional means of analysing gene expression (Northern blotting or RNAse protection assays) must be used to control microarray conclusions

(http://www.ebi.ac.uk/2can/databases/microarray.html)

 

back to top

Sequence alignment

What is sequence alignment?

Alignment is the result of a comparison of two or more gene or protein sequences in order to determine their degree of base or amino acid similarity. Sequence alignments are used to determine the similarity, homology, function or other degree of relatedness between two or more genes or gene products. The likelihood that the two sequences are related is represented in an alignment score. This score is calculated by totaling the scores for each matched pair of residues at each position in the alignment.

The general approach for similarity searching involves the use of a set of algorithms such as the BLAST programs to compare a query sequence to all the sequences in a specified database. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. The higher the score, the greater the degree of similarity. The similarity is measured and shown by aligning two sequences. Alignments can be global or local. A global alignment is an optimal alignment that includes all characters from each sequence, whereas a local alignment is an optimal alignment that includes only the most similar local region or regions. Discriminating between real and artifactual matches is done using an estimate of probability that the match might occur by chance. The similarity itself, is not a sufficient indicator of function.

 

Blast

The BLAST programs (Basic Local Alignment Search Tools) are a set of sequence comparison algorithms introduced in 1990 that are used to search sequence databases for optimal local alignments to a query. The BLAST programs improved the overall speed of searches while retaining good sensitivity (important as databases continue to grow) by breaking the query and database sequences into fragments ("words"), and initially seeking matches between fragments. The initial search is done for a word of length "W" that scores at least "T" when compared to the query using a given substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of "S". The "T" parameter dictates the speed and sensitivity of the search.

The quality of each pair-wise alignment is represented as a score and the scores are ranked. Scoring matrices are used to calculate the score of the alignment base by base (DNA) or amino acid by amino acid (protein). A unitary matrix is used for DNA pairs because each position can be given a score of +1 if it matches and a score of zero if it does not. Substitution matrices are used for amino acid alignments. These are matrices in which each possible residue substitution is given a score reflecting the probability that it is related to the corresponding residue in the query. The alignment score will be the sum of the scores for each position. Various scoring systems (e.g. PAM, BLOSUM and PSSM) for quantifying the relationships between residues have been used.

(http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html)

 

back to top

Protein modeling

 

What is protein modeling?

The process of evolution has resulted in the production of DNA sequences that encode proteins with specific functions. In the absence of a protein structure that has been determined by X-ray crystallography or NMR spectroscopy, researchers can try to predict the three-dimensional structure using protein or molecular modeling. This method uses experimentally determined protein structures (templates) to predict the structure of another protein that has a similar amino acid sequence (target).

Although molecular modeling may not be as accurate at determining a protein's structure as experimental methods, it is still extremely helpful in proposing and testing various biological hypotheses. Molecular modeling also provides a starting point for researchers wishing to confirm a structure through X-ray crystallography and NMR spectroscopy. As the different genome projects are producing more sequences, and because novel protein folds and families are being determined, protein modeling will become an increasingly important tool for scientists working to understand normal and disease-related processes in living organisms.

     

The Four Steps of Protein Modeling

  • Identify the proteins with known three-dimensional structures that are related to the target sequence.
  • Align the related three-dimensional structures with the target sequence and determine those structures that will be used as templates.
  • Construct a model for the target sequence based on its alignment with the template structure(s).
  • Evaluate the model against a variety of criteria to determine if it is satisfactory.

back to top

Phylogenetics

What is phylogenetics?

Phylogenetic systematics is that field of biology that does deal with identifying and understanding the evolutionary relationships among the many different kinds of life on earth, both living (extant) and dead (extinct).

Systematics describes the pattern of relationships among taxa and is intended to help us understand the history of all life. But history is not something we can see—it has happened once and leaves only clues as to the actual events. Scientists use these clues to build hypotheses, or models, of life's history. In phylogenetic studies, the most convenient way of visually presenting evolutionary relationships among a group of organisms is through illustrations called phylogenetic trees.

 

Molecular phylogenetics

The field of molecular phylogenetics has grown, both in size and in importance, since its inception in the early 1990s, attributable mostly to advances in molecular biology and more rigorous methods for phylogenetic tree building. The importance of phylogenetics has also been greatly enhanced by the successful application of tree reconstruction, as well as other phylogenetic techniques, to more diverse and perplexing issues in biology. Phylogenies are used essentially by drawing inferences from the structure of the tree or from the way the character states map onto the tree. Broadly speaking, the relationships established by phylogenetic trees often describe a species' evolutionary history and, hence, its phylogeny, the historical relationships among lineage's or organisms or their parts, such as their genes.

Molecular phylogenetics attempts to determine the rates and patterns of change occurring in DNA and proteins and to reconstruct the evolutionary history of genes and organisms. Two general approaches may be taken to obtain this information. In the first approach, scientists use DNA to study the evolution of an organism. In the second approach, different organisms are used to study the evolution of DNA. Whatever the approach, the general goal is to infer process from pattern: the processes of organismal evolution deduced from patterns of DNA variation and processes of molecular evolution inferred from the patterns of variations in the DNA itself.

Studies of gene and protein evolution often involve the comparison of homologs, sequences that have common origins but may or may not have common activity. Sequences that share an arbitrary level of similarity determined by alignment of matching bases that are homologous. These sequences are inherited from a common ancestor that possessed similar structure, although the ancestor may be difficult to determine because it has been modified through descent.

    A straightforward phylogenetic analysis consists of four steps:

    • Alignment—building the data model and extracting a dataset.
    • Determining the substitution model—consider sequence variation.
    • Tree building.
    • Tree evaluation.

 

back to top

Determination of metabolites

What is metabolites determination?

Metabolomic involves the systematic estimation of small molecules from a range of organisms, followed by statistical analyses and other investigations of that large quantity of data.

The draft sequences for the human genome has focused attention on the tremendous effort still required to understand the function of expressed genes, and the way in which genes and the proteins they encode interact within cells and organisms. Metabolomics, or the global analysis of cellular metabolites, provides a powerful new tool for gaining insight into functional biology. Snapshots of the level of small molecules within a cell, and how those levels change under different conditions, are complementary to gene expression and proteomic studies, and are actively being applied to studies of infectious diseases, production and model organisms, as well as human cells and plants.

It requires analytical techniques such as chromatography, molecular spectroscopy and mass spectrometry, coupled with multivariate data analysis methods. The aim of metabolomics is to obtain the widest possible coverage, in terms of the type and number of compounds analysed. This is achieved by making use of several, complementary analytical methods.

For target compound analysis and metabolic profiling, main techniques are gas chromatography (GC), high performance liquid chromatography (HPLC) and nuclear magnetic resonance (NMR). These approaches rely on chromatographic separations, often coupled with well-developed calibrations for specific analytes.

In metabolic fingerprinting, samples are analysed as crude extracts without any separation step, using NMR, direct injection mass spectrometry (MS), or Fourier transform infrared (FT-IR) spectroscopy. These fingerprinting approaches are often combined with multivariate analysis, to get the most out of the data.

 

Gas Chromatography (GC)

Developments involving gas chromatography have been responsible for the recent upsurge of interest in plant metabolomics. GC provides high-resolution compound separations, and can be used in conjunction with a flame ionisation detector (GC/FID) or a mass spectrometer (GC/MS). Both detection methods are highly sensitive and universal, able to detect almost any organic compound, regardless of its class or structure. However, most of the metabolites found in plant extracts are too involatile to be analysed directly by GC methods. The compounds have to be converted to less polar, more volatile derivatives before they are applied to the GC column. Efficient derivatisation methods are available, but relatively low sample throughput is a drawback of the GC method, particularly when there are many samples to be examined.

 

High Performance Liquid Chromatography (HPLC)

HPLC, with UV detection, is probably the most common method used for targeted analysis of plant materials, and for metabolic profiling of individual classes. A derivatisation step is not essential (unless needed for detection), since involatile and volatile substances may be measured equally well. Selection of compounds arises initially from the type of solvent used for extraction (as with all methods that use an extraction step), and then from the type of column and detector. For example HPLC/UV will only detect compounds with a suitable chromophore; a column selected for its ability to separate one class of compounds will not generally be useful for other types. HPLC profiling methods all rely to a great extent on comparisons with reference compounds. The full UV spectrum (measured for each peak when UV-diode array detectors are used) gives some useful information on the nature of compounds in complex profiles, but often indicates the class of the compound rather than its exact identity.

 

Nuclear Magnetic Resonance (NMR)

In principle, proton (1H) NMR can detect any metabolites containing hydrogen. Signals can be assigned by comparison with libraries of reference compounds, or by two-dimensional NMR. 1H NMR spectra of plant extracts are inevitably crowded not only because there is a large number of contributing compounds, but also because of the low overall chemical shift dispersion. 1H spectra are also complicated by spin-spin couplings which add to signal multiplicity, although they are an important source of structural information. In 13C NMR, the chemical shift dispersion is twenty times greater and spin-spin interactions are removed by decoupling. Despite these advantages, the low sensitivity of 13C NMR prevents its routine use with complex extracts. Sensitivity can be enhanced when seedlings are grown in the presence of 13C enriched carbon dioxide, but this is obviously only an option for laboratory based studies.

 

Direct Injection Mass Spectometry

It is also possible to obtain metabolite 'mass profiles' without any chromatographic separation. Such profiles are obtained by injecting crude extracts into the electrospray ionisation source of a high-resolution mass spectrometer. This technique generates mainly protonated, deprotonated or adduct molecules, such as [M+H]+, [M+cation]+ or [M-H]- for each species present in the mixture, with little or no fragmentation. Thus a fingerprint spectrum is obtained with a single peak for each metabolite, separated from other metabolites according to (accurate) molecular mass. The fingerprint can be used as a classification tool, for example in taxonomy. Some mass analysers are capable of ultra-high resolution and permit the mass to be determined to four or five decimal places. This allows unique formulae to be assigned to peaks with masses of a few hundred or so. The coupling of high sensitivity with high resolution provides a method of determining a rough estimate of the number of metabolites present and a valuable first indication, from the formulae, of their identities. Its main weakness is the inability to separate isomers of the same molecular mass.

Multivariate Analysis

Plant extracts are very complex in composition and, if many samples are examined, it is difficult to make meaningful comparisons of large numbers of spectra or chromatograms 'by eye'. Multivariate statistical methods can be extremely useful, as they are able to compress data into a more easily managed form. This can assist in visualizing, for example, how a given sample relates to other samples - a central issue in metabolomics. Multivariate analysis is practically essential in the fingerprinting approaches, but is also helpful in techniques where individual metabolites are explicitly quantified (eg GC/MS).

Principal component analysis (PCA) is a well-known and effective method of data compression. PCA transforms the original data (e.g. intensity values in a spectrum) into a set of 'scores' for each sample, measured with respect to the principal component axes ('loadings'). The PC scores replace the original variates, and are: (i) ordered, with successive PCs accounting for decreasing amounts of variance, and (ii) orthogonal, with no correlation between the scores on different axes. Due to these properties, a small number of PCs can replace the many original variates without much loss of information.

Scatter plots of the scores on the first few PC loadings provide an excellent means of visualizing and summarising the data and often reveal patterns that cannot be discerned in the original data. The scores plots may show clustering of similar samples, separation of different sample types, or the presence of outliers. Plots of the loadings themselves may be used to explore which compounds are most responsible for, say, separating samples into groups: the most important compounds (peaks) tend to correspond to high absolute loading values.

(http://www.metabolomics-nrp.org.uk/techniques.html)

 

back to top

Clustering

 

Under construction.

To homepage Wageningen UR