The project for Mining SNPs from EST data.

SNPs are becoming more and more the markers of choice for genetic analysis.  However, at the moment the number of SNP markers available is limited. A large number of putative SNPs is contained in publicly available databases.  This project aims at exploitation this information by developing as well as validating new software and algorithms for detection of reliable SNPs in public databases

Identification and characterisation of genes influencing a particular trait is another major bottleneck  in genetics and genomics. Two approaches are frequently used. The first utilizes information from databases (sequences, expression data, literature) to identify candidate genes: genes that are likely to influence the phenotypic trait based on similarity to known genes, or based on their expression profiles. The second approach involves genetic studies with molecular markers that pinpoint areas on the genome where candidate genes (or QTLs, if the traits are expressed quantitatively) are located. These mapping data need to be linked to whole genome sequences to identify putative candidate genes. This project will combine both approaches to find the genes underlying the trait of interest more rapidly and accurately. In addition, we will validate the tools via positive identification of the responsible gene(s) in Brassica. The objectives of this project are:

1. To extend bioinformatics tools for detection of reliable SNPs based on public databases

2.  To develop tools predicting candidate genes, especially aimed at QTLtrait(s)

3.  To validate SNPs and  predicted candidate genes, thereby validating the bio-informatics tools


Scientific approach and Methodology
 
New software and algorithms for detection of reliable SNPs

The strategies to filter SNPs in existing software (above) include controlling sequence quality, choosing alignments of 4 or more member sequences and identifying SNPs in haplotypes.

In our approach the following strategies will be used to filter SNPs:

1.    Use available additional information about the data sources. Important information can for instance be whether the data are collected from different individuals, different genotypes, and whether the individuals are heterozygous or homozygous.

2.    Documentation and evaluation of quality of sequence data. Trace files and quality files of the data sequences will be very helpful. We can set quality thresholds for the sequences as well  as for the contigs in which the sequences are collected.

3.    Thirdly, identify true SNPs. For this, specific criteria need to be formulated that enhance the reliability of the detected SNP. For instance, a minimum number of similar sequences present in a contig, limited or no variation in the SNP flanking sequences (Useche FJ et al. 2001).The overall genetic variability of the sequence is also important information, as well as the sequence identity (is the gene part of gene family, or duplicated, which will significantly hamper SNP genotyping).

4.    Design of the pipeline and interface. The phred, phrap and consed system has a command line / text file interface. It will only give SNPs information of one contig, but is able to give quality information about SNPs Besides, it is better suited to discover substitutions than insertion/deletion mutations (Leslie Picoult-Newberg et al. 1999). In contrast, a system like autoSNP has a web interface. AutoSNP can show SNP information across multiple contigs, but does not give quality information about SNPs. We will design a system that combines the strengths of different existing software packages.