The project for Mining SNPs
from EST data.
SNPs are becoming more and more the markers of
choice for genetic analysis. However, at the moment the number of
SNP markers available is limited. A large number of putative SNPs is
contained in publicly available databases. This project aims at
exploitation this information by developing as well as validating
new software and algorithms for detection of reliable SNPs in public
databases
Identification and characterisation of genes
influencing a particular trait is another major bottleneck in
genetics and genomics. Two approaches are frequently used. The first
utilizes information from databases (sequences, expression data,
literature) to identify candidate genes: genes that are likely to
influence the phenotypic trait based on similarity to known genes,
or based on their expression profiles. The second approach involves
genetic studies with molecular markers that pinpoint areas on the
genome where candidate genes (or QTLs, if the traits are expressed
quantitatively) are located. These mapping data need to be linked to
whole genome sequences to identify putative candidate genes. This
project will combine both approaches to find the genes underlying
the trait of interest more rapidly and accurately. In addition, we
will validate the tools via positive identification of the
responsible gene(s) in Brassica. The objectives of this project are:
1. To extend bioinformatics tools for
detection of reliable SNPs based on public databases
2. To develop tools predicting candidate
genes, especially aimed at QTLtrait(s)
3.
To validate SNPs and predicted candidate
genes, thereby validating the bio-informatics tools
Scientific approach
and Methodology
New software and algorithms for detection of reliable SNPs
The strategies to filter SNPs in existing
software (above) include controlling sequence quality, choosing
alignments of 4 or more member sequences and identifying SNPs in
haplotypes.
In our approach the following strategies will be
used to filter SNPs:
1.
Use available
additional information about the data sources. Important information
can for instance be whether the data are collected from different
individuals, different genotypes, and whether the individuals are
heterozygous or homozygous.
2.
Documentation and
evaluation of quality of sequence data. Trace files and quality
files of the data sequences will be very helpful. We can set quality
thresholds for the sequences as well as for the contigs in which
the sequences are collected.
3.
Thirdly, identify true
SNPs. For this, specific criteria need to be formulated that enhance
the reliability of the detected SNP. For instance, a minimum number
of similar sequences present in a contig, limited or no variation in
the SNP flanking sequences (Useche FJ et al. 2001).The overall
genetic variability of the sequence is also important information,
as well as the sequence identity (is the gene part of gene family,
or duplicated, which will significantly hamper SNP genotyping).
4.
Design of the pipeline and interface. The phred, phrap and consed
system has a command line / text file interface. It will only give
SNPs information of one contig, but is able to give quality
information about SNPs Besides, it is better suited to discover
substitutions than insertion/deletion mutations (Leslie Picoult-Newberg
et al. 1999). In contrast, a system like autoSNP has a web
interface. AutoSNP can show SNP information across multiple contigs,
but does not give quality information about SNPs. We will design a
system that combines the strengths of different existing software
packages.
|