HaploSNPer Manual

HaploSNPer Manual

Index

1. Introduction of HaploSNPer

HaploSNPer is a flexible web-based tool for detecting haplotypes and single nucleotide polymorphisms (SNPs) in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in the species-specific EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable haplotypes and SNPs. All possible and reliable haplotypes (which in this context are the available alleles of a gene in the database) are detected by a mathematical algorithm using potential polymorphism information. Reliable SNPs are then identified based on the reconstructed haplotypes and sequence redundancy.

2. How to use the program

HaploSNPer is a friendly and flexible tool, it can be run in two possible ways, either interactive process: users submit sequences and wait for the process, or in batch mode: users submit sequences and a email at the same time, and the links of their results will be sent to them by email upon job completion and also a download link is supplied to save their results.

Moreover, the input option is flexible: users can submit a sequence in FASTA format as the seed sequence, and sequences homologous to the seed will be retrieved from the selected database; at the same time, users can submit a number of similar or homologous sequences to the seed sequence as well. This is especially useful when users want some homologous sequences that are not yet available in public database to be part for the detection of haplotypes and SNPs, and these sequences and the seed sequence will be merged with the homologous sequences from the selected database, and haplotypes and SNPs will be detected on the merged dataset. Alternatively, HaploSNPer can process a dataset containing a number of homologous or similar sequences submitted by users without specifying a database, the detection will be executed only on this dataset.

The seed sequence is either input in the seed sequence text area or uploaded from a file (the left part of the input option). Other similar sequences are either input in the text area of other similar sequences or uploaded from a file (the right part of the input option). Importantly, these sequences should be formatted to FASTA format (see below). Parameters can be set to tailor the performance and output of HaploSNPer to the specific requirements of the user.

1) Format of input sequences

The seed sequence and other similar sequences need formatting in FASTA. Besides nucleotide information, as well as other related information like cultivar or strain, tissue and unigene ID also can be included. The name description of these sequences is as following:

>sequence name|genotype|tissue|unigene ID

For example:

a. An example of the seed sequence from apple

>AY789239|Prima|leaf|unknown
ATGGGTGTCTACACATTTGAGAACGAGTACACCTCTGAGATTCCACCACCAAGATTGTTC
AAGGCCTTTGTCCTCGATGCTGATAACCTCATCCCCAAGATTGCACCCCAGGCAATCAAG
CATGCTGAGATCCTTGAAGGAGACGGTGGCCCTGGAACCATCAAGAAGATCACTTTTGGT
GAAGGCAGCCAATACGGCTACGTGAAGCACAAGATCGACTCAGTTGACGAAGCAAACTAC
TCATACGCCTACACTTTGATTGAAGGAGATGCTTTGACAGACACCATTGAGAAGGTCTCT
TACGAGACCAAGTTGGTGGCATCTGGAAGTGGTTCCATCATCAAGAGTATCAGCCACTAC
CACACCAAGGGTGATGTTGAGATCAAGGAAGAGCACGTCAAGGCTGGCAAAGAGAAGGCT
CATGGTTTGTTCAAGCTTATTGAGAGCTACCTCAAGGGCCACCCCGACGCATACAACTAA

b. an example of other sequences similar to the seed sequence

>AY789238|Fiesta|leaf|unknown
ATGGGTGTCTGCACATTTGAGAACGAGTTCACCTCTGAGATTCCACCATCAAGATTGTTC
AAGGCCTTTGTCCTTGATGCTGACAACCTCATCCCCAAGATTGCACCCCAGGCAATCAAG
CAAGCTGAAATCCTTGAAGGAAACGGTGGCCCCGGAACCATCAAGAAGATCACTTTTGGT
GAAGGCAGCCAGTACGGCTACGTGAAGCACAGGATTGACTCAATTGACGAAGCAAGCTAC
TCATACTCCTACACTTTGATTGAAGGAGATGCTTTGACAGACACCATCGAGAAAATATCT
TACGAGACCAAGTTGGTGGCATGTGGAAGTGGTTCCACCATCAAGAGCATCAGTCATTAC
CACACCAAGGGAAACATTGAAATCAAGGAAGAGCACGTCAAGGCTGGAAAAGAGAAGGCC
CATGGTTTGTTCAAACTTATTGAGAGCTACCTTAAGGACCACCCCGACGCATACAACTAA
>AY789240|Prima|leaf|unknown
ATGGGTGTCTACACATTTGAGAACGAGTACACCTCTGAGATTCCACCACCAAGATTGTTC
AAGGCCTTTGTCCTCGATGCTGATAACCTCATCCCCAAGATTGCACCCCAGGCAATCAAG
CATGCTGAGATCCTTGAAGGAGACGGTGGCCCTGGAACCATCAAAAAGATCACTTTTGGT
GAAGGTAGCCAATACGGCTACGTGAAGCACAAGATCGACTCGGTTGACGAGGCAAACTAC
TCATACGCCTACACTTTGATTGAAGGAGATGCTTTGACAGACACCATTGAGAAGGTCTCT
TACGAGACCAAGTTGGTGGCATCTGGAAGTGGTTCCATCATCAAGAGTATCAGCCACTAC
CACACCAAGGGTGATGTTGAGATCAAGGAAGAGCACGTCAAGGCTGGCAAAGAGAAGGCT
CATGGTTTGTTCAAGCTTATTGAGAGCTACCTTAAGGGCCACCCCGACGCATACAACTAA

Note:

1. If you don't know genotype, tissue and unigene ID of your sequences, and it is possible to use only sequence name or sequence accession. For instance: >seq112233

2. If descriptions of the sequence name, genotype and tissue contain spaces,then substitute it by "-". Such as: >AB1001|apple-tree|young-leaf

3. The seed sequence and other similar sequences can also be uploaded from simple text files after these sequences are formatted in proper format.

4. The annotation information (cultivar/strain, genotype, tissue and unigene) is retrieved by the accession of a sequence from EMBL database. It means that only sequences existing in the public database EMBL or NCBI may have the annotation information.

5. The number of similar or homologous sequences submitted by the option of "Other similar sequences" is limited to 500. If you want to process more than 500 sequences, please contact us: mail webmaster@bioinformatics.nl.

The input interface of HaploSNPer is as following:

2) Settings of HaploSNPer's Parameters

There are seven component parameters required to control the performance of HaploSNPer, a) selecting a tagging database, b) selecting a sequence alignment program, c) pre-processing of sequences, d)settings for BLAST and CAP3 or PHRAP, e) settings for haplotype reconstruction, f) settings for low quality region of sequences, g) settings for SNP detection, h) and settings for visualizing output. Of which e, f and g are used to control the performance of QualitySNP.

a. Selecting a tagging database

The tagging database can be selected from the list of species-specific EST databases. These databases contain all EST sequences of these species extracted from the EMBL database, the number of sequences in each database and EMBL version can be checked by the link "data info". Moreover, these databases store not only nucleotide information, but also including cultivar or strain, tissue and unigene ID if they are available in EMBL database. The information will be used by HaploSNPer for showing haplotypes and SNPs.

b. Selecting a sequence alignment program

CAP3 or PHRAP can be chosen for sequence alignment. For SNP mining, CAP3 is recommended as it uses individual sequence overlap for cluster constructing, while PHRAP tends to extend the consensus sequence by overlap. However, PHRAP is much faster than CAP3.

c. Pre-processing of sequences

Many sequences obtained from the public database were poor quality containing vector fragments. It can make wrong sequence assembling and SNP detection. Cross_match is a good tool to remove vector sequences according to the vector file that was downloaded from NCBI. Besides containing vector sequences, some sequences contain long repeat fragments, which can make wrong sequence assembling as well. RepeatMasker is good at masking repeat fragments. The Repeat database was downloaded from http://www.girinst.org/. Taking account into these cases, HaploSNPer supplies Cross_match and RepeatMasker to clean sequences for users. Usually, Cross_match and RepeatMasker take long time to process sequences. So when one of them is selected, an email address is required and the output is returned the email address instead of waiting for a web page.

Vector file and Repeat database will be updated as their new versions are released.

d. Settings of parameters for BLAST and CAP3 or PHRAP

BLAST is used to search for sequences homologous/similar to the input seed sequence. E-value for BLAST can be set to select sequences similar to the seed sequence. When E-value is high, many similar sequences will be found and that will produce many clusters; while E-value is low, there will be not enough similar sequences for haplotype and SNP detection. A series of E-values have been tested on several sequences; an E-value of 1e-60 usually results in the selection of sequences sufficiently similar to the seed sequence, and this value is the default in HaploSNPer. Users can adjust it according to the results using the default settings. For exampe, if no homologous sequences are found, users can try to increase the E-value that may find some similar sequences; if too many clusters are produced, user can try to decrease the setting that results in less sequence seleted.

The similarity for sequence assembling of CAP3 or PHRAP can be set by users. According to our experience, the similarity for CAP3 and PHRAP is 95%, which is stringent enough to prevent most paralogous sequences and keep all available allelic sequences in a cluster.

e. Settings for haplotype reconstruction

In the QualitySNP algorithm, a potential haplotype is defined as a group of sequences within a cluster that have the same nucleotide at every polymorphic site. For haplotype reconstruction, the similarity between a candidate sequence and a haplotype group at each single SNP is calculated and compared with a threshold to determine whether the candidate sequence matches the haplotype group at that SNP; then the similarity over all SNPs is compared with a second threshold to determine whether the candidate sequence can be reliably assigned to the haplotype group. By using the similarity per polymorphic site as well as the similarity over all polymorphic sites, haplotypes can even be reconstructed reliably with sequences with sequencing errors.

Thresholds for similarity per polymorphic site and similarity over all polymorphic sites can be set according to the (suspected) similarity between a gene and its paralogs, and the similarity between its alleles. Too low settings for these thresholds may result in several alleles or even paralogous sequences being classified as a single haplotype, while too high settings would result in the separation of allelic sequences into different haplotypes. After a series of tests on the potato data set, the threshold value of similarity per polymorphic site is 75% and that of similarity over all polymorphic sites is 80% are optional.

f. Settings for low quality region

One of the criteria for identifying reliable SNPs is confidence score, which is calculated by sequence redundancy in high quality and low quality region. The extent of the low-quality regions at the 5' and 3' ends of the sequences can be specified. Based on examination of public EST sequences, we found that the 5' low-quality region is around 30 nucleotides, while the 3' low-quality region is about 20% of the sequence length; these values are set as defaults in HaploSNPer. The weight value for low quality region can be set based on the quality of nucleotides in this region. In our study, we found nucleotides in low quality region only have 0.5 of the reliability. It means in low quality region at least two sequences have the same nucleotide at a certain loci, and then the nucleotide is reliable. 0.5 is the default for the weight value of low quality region. The value can be set by users, for instance: if users don't want to use low quality region of sequences, just set the weight value to 0.

g. Settings for SNP detection

In our program, sequence redundancy is also used to prevent sequencing errors. The default setting for the minimum cluster size is 4, and the minimum allele size is 2. The default setting for the minimum value of confidence score is 2. The confidence score is higher; the reliability of the SNP on sequence redundancy is higher.

Based on the high quality (HQ) and low quality (LQ) region, the confidence score of each allele is calculated according to the score rules as defined in the following; and then the SNP confidence score is the smaller one of each allele confidence score. The score rules for confidence scores of each allele are as follows: 5 if the allele occurs in more than one HQ; 4 if in one HQ and at least two LQ; 3 if in more than 3 LQ; 2 if in one HQ and one LQ, or in 3 LQ; 1 if in 2 LQ, otherwise 0. In our study, we assigned a high confidence score (HCS) to SNPs with a confidence score of at least 2. This threshold can be adjusted by users according to their specific requirements.

h. Settings for output

The output of HaploSNPer can be displayed in two ways: the user can choose to view the cluster with the seed sequence, or all related clusters.

An email address is required if users want the output returned to a email address instead of waiting for a web page. As well, users can run a job directly, and then it goes into a interactive process. When the total number of homologous sequences (sequences users submitted and sequences obtained from the selected database) is more than the threshold (100 sequences for human and mouse,and 200 for other species), a email is required to send results back since it may take long time to process. When the number of sequences is not more than the threshold, HaploSNPer will show the results in a new webpage to the user directly.

Note:

All these parameters can be set according to user's requirements. Stringent settings will increase the number of reliable haplotypes, and probably some false positive SNPs will be included. Loose settings will decrease the number of reliable SNPs and some positive SNPs will be missed.

Suggestion: Use the default settings to run HaploSNPer, and then users adjust these parameters based on the results and their experience.

3) The process of the program

After clicking the "submit" button, the program will process the input sequences based on these settings.

In order to illustrate the performance of HaploSNPer, sequences from the Mal d 1 genes of apple (Gao et al., 2005; TAG) were used as the input. In this case the input consisted of a seed sequence and four homologous sequences; and all parameters were kept as default. After submitting the job, the processing page was showed as following:

In the processing page, the database and parameters set by user were showed. 294 homologous sequences obtained from apple EST database and 4 similar sequences from the input that were more than 200 sequences in total. An email address is required to send the results back when SNP detection is finished since it may take long time to process them. Users put their email address there and then wait for email. In the email, the link of results will be contained. Users can read results by clicking those links.

If users want to get results directly, user can try to lower E-value threshold. For example, E-value was set to e-180, suppose it would get fewer sequences than 200, and other parameters still kept in default. The processing page is as following:

193 sequences were obtained from database under the threshold of E-value e-180. These sequences were merged with sequences (4 other similar sequences and 1 seed sequence ) that the user submitted were used to detect SNPs. There were two buttons showed in the page, one button for showing results and the other for downloading results.

Note:

The results would be kept on our server in two days.

3. How to use the results

The results include three parts: the first part shows parameters set for this process; the second shows the summary of haplotypes, SNPs and D-value of all related clusters; the third part contains the haplotype and SNPs information of each cluster.

1) The first part of the results is as following:

2) The second part of the results

3) The third part of the results

If the output option "only the cluster with seed sequence" is selected, only the cluster with the seed sequence will be showed; If "All clusters related to the seed" is chosen, all clusters with reliable SNPs will be showed one by one. The information for each cluster insists of three parts (A, B and C). The A part is all information about the potential SNPs and reliability of SNPs in the cluster. B part shows the haplotypes reconstruction based on potential SNPs and sequence alignment information by clicking the link "show sequence alignment information". C part shows the reliable haplotypes and reliable SNPs. In this case, two clusters were produced. In here the cluster with the seed was used as an example. A part of the cluster follows:

When click the link "Show sequence alignment information", the sequence alignment and haplotype reconstruction (B part) will be showed in a new window as following:

In this window, the mouse is over the link of a sequence accession, the annotation information (cultivar, tissue and unigene ID) of the sequence will be showed, such as EB177853.

C part of the result focuses on showing reliable SNPs and reliable Haplotypes, see the below.

In A part, there were 19 potential SNPs defined in 5 reliable haplotypes, of which 11 were identified as reliable SNPs. In B part, the haplotypes from 1 to 5 consist of at least two sequences, and were considered as reliable haplotypes. In contrast, the last haplotype which consisted of one sequence only, are considered as unreliable due to poor reliability of sequence quality. These haplotypes would not be used to detect reliable SNPs. C part shows relationship between reliable haplotypes and reliable SNPs.