Bioinformatics

Molecular biology in the internet

Main page

Appointments

Bioinformatics

Literature

Exercises

Tasks

Databases

Software

Sequence comparisons

Homology searches

Motif searches

Hidden Markov models

Hydrophobicity analyses

Topology and helix packing

Protein localization

Secondary structure

Super-secondary structure

3D structure

 

    Primer to database searches:

    Although there are several different comparison programs available (e.g., BLASTP, FASTA, SSEARCH, and BLITZ) that can be used with different scoring systems (e.g., PAM120, PAM250, BLOSUM50, BLOSUM62) and different databases (e.g., PIR, SWISS-PROT, GenPept), the following search protocol should identify homologous sequences whenever they can be found. 1. Always compare protein sequences if the genes encode proteins. Protein sequence comparison will typically double the evolutionary lookback time over DNA sequence comparison. 2. Search several sequence databases using a rapid sequence comparison program (e.g., BLASTP or FASTA, ktup = 2). Well-curated databases like PIR or SWISS-PROT tend to have fewer redundant sequences, which improves the statistical significance of a match, but they are less comprehensive and up-to-date than GenPept. 3. If there is good agreement between the distribution of scores and the theoretical distribution, and the alignments do not include "simple sequence" domains, accept sequences with FASTA E() values or BLASTP P() values below 0.02 as homologous. 4. If no library sequences are found with E values below 0.02, perform additional searches with FASTA, ktup = 1, or SSEARCH. If library sequences with E values less than 0.02 are found, the sequences are probably homologous, unless a low-complexity domain is aligned. However, sequences with similarity scores from 0.02 to 10.0 may be homologous as well. To characterize these more distantly related sequences, select "marginal" library sequences and use them to search the databases. Additional family members should have E values less than 0.05. 5. Homologous sequences share a common ancestor, and thus a common protein fold. Depending on the evolutionary distance and divergence path, two or more homologous sequences may have very few absolutely conserved residues. However, if homology has been inferred between A and B, between B and C, and between C and D, A and D must be homologous, even if they share no significant similarity. 6. Sequences with marginal E values should also be tested using the PRSS program. Compare the query and library sequences using at least 200 (and preferably 1000) shuffles. Shuffles using a window (-w) of 10-20 are more stringent than a uniform shuffle. Use the E value after 1000 shuffles to confirm an inference of homology. 7. Homologous sequences are usually similar over an entire sequence or domain, typically sharing 20-25% or greater identity for more than 200 residues. Matches that are more than 50% identical in a 20- to 40-amino acid region occur frequently by chance and do not indicate homology. By following these steps, one will very rarely assert that two sequences are homologous when in fact they are not. However, these criteria are stringent; distantly related homologous sequences may fail to be detected because their similarity is not statistically significant. These tests are biased toward missing some distantly related sequences to avoid the possibility of misidentifying unrelated ones. In most database searches, the ratio of related to unrelated sequences is more than 4000:1 (e.g., 10 related and 40,000 unrelated sequences). Thus, one is more likely to mistakenly identify two sequences as related than to overlook a genuine relationship, and our conservative evaluation criteria reflect that bias. (Pearson, 1996)

    The FASTA3 and FASTA2 packages provide a flexible set of sequence-comparison programs that are particularly valuable because of their accurate statistical estimates and high-quality alignments. Traditionally, sequence similarity searches have sought to ask one question: "Is my query sequence homologous to anything in the database?" Both FASTA and BLAST can provide reliable answers to this question with their statistical estimates; if the expectation value E is < 0.001-0.01 and you are not doing hundreds of searches a day, the answer is probably yes. In general, the most effective search strategies follow these rules: 1. Whenever possible, compare at the amino acid level, rather than the nucleotide level. Search first with protein sequences (blastp, fasta3, and ssearch3), then with translated DNA sequences (fastx, blastx), and only at the DNA level as a last resort (Table 5). 2. Search the smallest database that is likely to contain the sequence of interest (but it must contain many unrelated sequences for accurate statistical estimates). 3. Use sequence statistics, rather than percent identity or percent similarity, as your primary criterion for sequence homology. 4. Check that the statistics are likely to be accurate by looking for the highest-scoring unrelated sequence, using prss3 to confirm the expectation, and searching with shuffled copies of the query sequence [randseq, searches with shuffled sequences should have E approx 1.0]. 5. Consider searches with different gap penalties and other scoring matrices. Searches with long query sequences against full-length sequence libraries will not change dramatically when BLOSUM62 is used instead of BLOSUM50 (20), or a gap penalty of -14/-2 is used in place of -12/-2. However, shallower or more stringent scoring matrices are more effective at uncovering relationships in partial sequences (3,18), and they can be used to sharpen dramatically the scope of the similarity search. However, as illustrated in the last section, the E value is only the first step in characterizing a sequence relationship. Once one has confidence that the sequences are homologous, one should look at the sequence alignments and percent identities, particularly when searching with lower quality sequences. When sequence alignments are very short, the alignment should become more significant when a shallower scoring matrix is used, e.g., BLOSUM62 rather than BLOSUM50 (remember to change the gap penalties). Homology can be reliably inferred from statistically significant similarity. Whereas homology implies common three-dimensional structure, homology need not imply common function. Orthologous sequences usually have similar functions, but paralogous sequences often acquire very different functional roles. Motif databases, such as PROSITE (21), can provide evidence for the conservation of critical functional residues. However, motif identity in the absence of overall sequence similarity is not a reliable indicator of homology. (Pearson, 2000)
    Sequence formats:

  • Example of a SwissProt entry in FASTA format
    Similarity matrices:

  • PAM250: Percent Accepted Mutation-Matrix (Dayhoff et al., 1978)
  • BLOSUM62: Blocks Substitution-Matrix (Henikoff and Henikoff, 1992)
  • More Similarity matrices
    • HELP to proper use of similarity matrices
    Sequence examples:

  • Database of amino acid sequences via Entrez
  • Database of nucleotide sequences via Entrez

  • Bacteriorhodopsin from Halobacterium salinarium: Seven-helix bundle protein
  • TonB from Escherichia coli: Protein with N-terminal transmembrane helix
  • Maltose-binding protein from Escherichia coli: Protein with N-terminal signal sequence for secretion into the periplasmic space
  • OmpA from Escherichia coli: Two-domain protein: the N-terminal protein domain is embedded into the outer membrane in form of an 8-stranded β barrel while the C-terminal protein domain is found in the periplasmic space
    Abbreviations:

  • BALSA: Bayesian Algorithm for Local Sequence Alignment
  • BEAUTY: BLAST Enhanced Alignment Utility
  • BLAST: Basic Local Alignment Search Tool
  • BLOSUM: Blocks Substitution-Matrix
  • EASY: Expert Analysis SYstem
  • PAM: Percent Accepted Mutation
  • PHI-BLAST: Pattern-Hit Initiated BLAST
  • PSI-BLAST: Position-Specific Iterated BLAST
  • RPS-BLAST: Reverse Position Specific BLAST
  • WU: Washington University

  • BLASTN: Nucleotide sequence versus nucleotide sequence database
  • BLASTP: Amino acid sequence versus amino acid sequence database
  • BLASTX: Translated nucleotide sequence versus amino acid sequence database
  • TBLASTN: Amino acid sequence versus translated nucleotide sequence database
  • TBLASTX: Translated nucleotide sequence versus translated nucleotide sequence database

 

Latest update: November 19, 2009


Ralf Koebnik
Institut de recherche pour le dèveloppement
UMR 5096, CNRS-UP-IRD
911, Avenue Agropolis, BP 64501
34394 Montpellier, Cedex 5
FRANCE
Phone: +33 (0)4 67 41 62 28
Fax: +33 (0)4 67 41 61 81
Email: koebnik(at)gmx.de
Please replace (at) by @.


Home Back to main page