Bioinformatics

Molecular biology in the internet

Main page

Appointments

Bioinformatics

Literature

Exercises

Tasks

Databases

Software

Sequence comparisons

Homology searches

Motif searches

Hidden Markov models

Hydrophobicity analyses

Topology and helix packing

Protein localization

Secondary structure

Super-secondary structure

3D structure

 

    Motif searches in sequence databases:

    Several databases exist which provide information about conserved regions within proteins. One can query these databases with own sequences or sequence blocks. The following websites can be used to access several motif search algorithms:

  • Overview about Database with protein sequence motifs
  • Overview about Database with protein domains
  • Overview about Database of individual protein families

  • The PROSITE Database of Protein Families and Domains
  • The MEME/MAST System:
    MEME (Multiple EM for Motif Elicitation) is a tool for discovering motifs in a group of related DNA or protein sequences.
    MAST (Multiple Alignment and Search Tool) is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs.
  • The Blocks Database
    • Suche eines Datenbank-Eintrags
    • Blocks Searcher: Protein sequence versus Blocks
    • Reverse PSI-BLAST: Protein sequence versus Blocks
    • IMPALA (Integrating Matrix Profiles And Local Alignments): Protein sequence versus Blocks
    • LAMA (Local Alignment of Multiple Alignments): Sequence block (multiple alignment) versus Blocks
    • CODEHOP (COnsensus-DEgenerate Hybrid Oligonucleotide Primers): Design of oligonucleotides for hybridization or PCR experiments
  • The PRINTS Protein motif fingerprint database
  • The Pfam Protein Family database at the Sanger Center or Pfam at WUStL
  • The ProDom Protein Domain database
  • The DOMO Database of homologous protein Domain families
  • The eMOTIF Protein sequence Motif determination and searches
  • The PROF_PAT Database of protein family Patterns
  • The SBASE Protein domain library
  • The SYSTERS (SYSTEmatic Re-Searching) protein sequence cluster set

  • The iProClass Integrated Protein Classification database
  • The InterPro Integrated resource of Protein domains and functional sites
  • The MetaFam Unified classification of protein Families (OUTDATED)
    PROSITE - Database of Protein Families and Domains

    PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns (motifs) and profiles (weight matrices) that help to reliably identify to which known protein family (if any) a new sequence belongs. It is based on the observation that, while there is a huge number of different proteins, most of them can be grouped, on the basis of similarities in their sequences, into a limited number of families. Proteins or protein domains belonging to a particular family generally share functional attributes and are derived from a common ancestor. It is apparent, when studying protein sequence families, that some regions have been better conserved than others during evolution. These regions are generally important for the function of a protein and/or for the maintenance of its three- dimensional structure. By analyzing the constant and variable properties of such groups of similar sequences, it is possible to derive a signature for a protein family or domain, which distinguishes its members from all other unrelated proteins. A pertinent analogy is the use of fingerprints by the police for identification purposes. A fingerprint is generally sufficient to identify a given individual. Similarly, a protein signature can be used to assign a newly sequenced protein to a specific family of proteins and thus to formulate hypotheses about its function. PROSITE currently contains signatures specific for about a thousand protein families or domains. Each of these signatures comes with documentation providing background information on the structure and function of these proteins.
    The Blocks database

    The blocks for the Blocks Database are made automatically by looking for the most highly conserved regions in groups of proteins documented in the PROSITE Database. The PROSITE pattern for a protein group is not used in any way to make the Blocks Database and the pattern may or may not be contained in one of the blocks representing a group. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the chance distribution of matches. It is these calibrated blocks that make up the Blocks Database. The WWW versions of the PROSITE and SWISS-PROT Databases that are used on this server are located at the ExPASy World Wide Web (WWW) Molecular Biology Server of the Geneva University Hospital and the University of Geneva.
    The blocks created by Block Maker are created in the same manner as the blocks in the Blocks Database but with sequences provided by the user. Results are reported in a multiple sequence alignment format without calibration and in the standard Block format for searching.
    Introduction

    As an aid to detection and verification of protein sequence homology, the Blocks Searcher compares a protein or DNA sequence to the current database of protein blocks. Blocks are short multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins.
    The rationale behind searching a database of blocks is that information from multiply aligned sequences is present in a concentrated form, reducing background and increasing sensitivity to distant relationships. This information is represented in a position-specific scoring table or "profile" (4), in which each column of the alignment is converted to a column of a table representing the frequency of occurrence of each of the 20 amino acids. For searching a database of blocks, the first position of the sequence is aligned with the first position of the first block, and a score for that amino acid is obtained from the profile column corresponding to that position. Scores are summed over the width of the alignment, and then the block is aligned with the next position. This procedure is carried out exhaustively for all positions of the sequence for all blocks in the database, and the best alignments between a sequence and entries in the Blocks database are noted. If a particular block scores highly, it is possible that the sequence is related to the group of sequences the block represents. Typically, a group of proteins has more than one region in common and their relationship is represented as a series of blocks separated by unaligned regions. If a second block for a group also scores highly in the search, the evidence that the sequence is related to the group is strengthened, and is further strengthened if a third block also scores it highly, and so on.
    The Prints Protein Motif Fingerprint Database in Blocks Format

    The Blocks WWW Server optionally searches a complete version of Terri Attwood's PRINTS Database in Blocks format using the Blimps searching program. Although this service is not available from the Blocks email server, a subset of PRINTS blocks not represented in the Blocks Database is always searched. Because PRINTS includes blocks from more than 300 families not represented in the Blocks Database, and because different methods are used to construct blocks for families represented in both databases, we recommend searching both databases.
    Blocks from Pfam protein groups

    The Blocks WWW Server always searches a subset of blocks made in the same way as the Blocks Database from groups of sequences listed in the Pfam Database. While each entry in the PROSITE and PRINTS Databases corresponds to a family of proteins which may share multiple conserved domains, each Pfam entry corresponds to a single conserved domain. We take the SWISS-PROT (but not TrEMBL) protein sequences listed for a Pfam domain and make blocks, usually obtaining multiple blocks and usually (but not always) one of the blocks corresponds to the Pfam domain. We then use LAMA to compare the resulting blocks with the Blocks and PRINTS Databases and with previously saved blocks from Pfam and only save those that represent new regions. In this way we try to expand the set of conserved domains searched.
    Blocks from ProDom protein groups

    The Blocks WWW Server always searches a subset of blocks made in the same way as the Blocks Database but from groups of sequences listed in the ProDom Database. While each entry in the PROSITE and PRINTS Databases corresponds to a family of proteins which may share multiple conserved domains, each ProDom entry corresponds to a single conserved domain. We take the SWISS-PROT protein sequences listed for a ProDom domain and make blocks, usually obtaining multiple blocks and usually (but not always) one of the blocks corresponds to the ProDom domain. We then use LAMA to compare the resulting blocks with the Blocks and PRINTS Databases and with previously saved blocks from ProDom and only save those that represent new regions. In this way we try to expand the set of conserved domains searched.
    Blocks from Domo protein groups

    The Blocks WWW Server always searches a subset of blocks made in the same way as the Blocks Database from groups of sequences listed in the Domo Database. While each entry in the PROSITE and PRINTS Databases corresponds to a family of proteins which may share multiple conserved domains, each Domo entry corresponds to a single conserved domain. We take the SWISS-PROT (but not PIR) protein sequences listed for a Domo domain and make blocks, usually obtaining multiple blocks and usually (but not always) one of the blocks corresponds to the Domo domain. We then use LAMA to compare the resulting blocks with the Blocks and PRINTS Databases and with previously saved blocks from Domo and only save those that represent new regions. In this way we try to expand the set of conserved domains searched.
    PRINTS - Protein Motif Fingerprint Database

    PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of a SWISS-PROT / TrEMBL composite. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbours.
    Pfam - Protein Family Database

    Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains. Version 6.2 of Pfam (April 2001) contains alignments and models for 2773 protein families, based on the Swissprot 39 and SP-TrEMBL 14 protein sequence databases.
    ProDom - Protein Domain Database

    July 1998 (ProDom 35)

    The ProDom protein domain database consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches (Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ, 1997, Nucleic Acids Res., 25:3389-3402; Gouzy J., Corpet F. & Kahn D., 1999, Computers and Chemistry 23:333-340.) Large families are much better processed with this new procedure than with the former DOMAINER program (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci., 3:482-492).

    March 2001 (ProDom 2001.1)

    390 ProDom families were generated automatically using PSI-BLAST with a profile built from the seed aligments of Pfam-A 4.3 families.
    DOMO - Database of Homologous Protein Domain Families

    DOMO is a database of homologous protein domain families. It was obtained from successive sequence analysis steps including similarity search, domain delineation, multiple sequence alignment and motif construction. 83054 non redundant protein sequences from SWISSPROT and PIR have been analysed yielding a database of 99058 domains clustered into 8877 multiple sequence alignments. The current release has 8877 entries and was indexed 19-May-2001.
    InterPro - Integrated Resource of Protein Domains and Functional Sites

    InterPro release 1.0 (March 2000) was built from Pfam 5.0, PRINTS 25.0, PROSITE 16 and the current SWISS-PROT + TrEMBL data. This release of InterPro contains 2990 entries, representing 2373 families, 556 domains, 47 repeats and 14 post-translational modification sites encoded by 4884 different regular expressions, profiles, fingerprints and HMMs.
    Further search possibilities in sequence databases:

    In addition to searches for similar (homologous) sequences or sequence motifs, one can also search sequence databases by other criteria. Some interesting approaches are listed below:

  • Hydrophilicity search

    This page calculates the hydrophilicity/hydrophobicity profile of a given protein and compares it with a library of protein hydropathic profiles (in this case the library of protein hydrophilicities is made from the SWISS-PROT database). Please enter information about the query protein, select a method of calculating the hydrophilicity, and the size of the window to calculate it over, and the number of matches you would like to see. A page will then be returned listing the best protein hydropathic profile matches. To see a plot of the two protein hydropathic profiles simply select the protein you are interested in.

  • PROPSEARCH

    Common protein sequence alignment programs are at present not capable to detect functional and / or structural homologs, if the sequence identity is below the significance threshold of about 25%. PROPSEARCH was designed to find the putative protein family if querying a new sequence has failed using alignment methods. By neglecting the order of amino acid residues in a sequence, PROPSEARCH uses the amino acid composition instead. In addition, other properties like molecular weight, content of bulky residues, content of small residues, average hydrophobicity, average charge a.s.o. and the content of selected dipeptide- groups are calculated from the sequence as well. 144 such properties are weighted individually and are used as query vector. The weights have been trained on a set of protein families with known structures, using a genetic algorithm. Sequences in the database are transformed into vectors as well, and the euclidian distance between the query and database sequences is calculated. Distances are rank ordered, and sequences with lowest distance are reported on top (Hobohm and Sander, 1995).

  • PatScan

    PatScan is a pattern matcher which searches protein or nucleotide (DNA, RNA, tRNA etc.) sequence archives for instances of a pattern which you input (Dsouza et al., 1997).

  • AACompIdent

    AACompIdent is a tool which allows the identification of a protein from its amino acid composition. It searches the SWISS-PROT and / or TrEMBL databases for proteins, whose amino acid compositions are closest to the amino acid composition given. You will have to enter the following data:

    1. Amino acid composition of the protein to identify.
    2. A name for this protein, so that you can recognize it later in the results.
    3. The pI and Mw of that protein, if known, as well as error ranges that reflect the accuracy of these estimates.
    4. The species or group of species for which you would like to perform the search (example: HOMO SAPIENS or MAMMALIA). This will produce the list of proteins from this species, as well as a list of proteins independently of species. You may also just specify ALL for all SWISS-PROT / TrEMBL entries; If in doubt about the search term to use, consult the SWISS-PROT list of species.
    5. For scan in SWISS-PROT only: the keyword for which you would like to perform the search (example: ZINC-FINGER). This will produce the list of proteins matching this keyword. You may also just specify ALL for all SWISS-PROT entries; If in doubt about the exact keyword to use, consult the list of keywords used in SWISS-PROT.
    6. Amino acid composition of a known protein, obtained in the same run as the amino acid composition of the unknown protein. This is for calibration; if you do not have a calibration protein, leave NULL.
    7. The SWISS-PROT identifier (ID) of the calibration protein (example: ALBU_HUMAN).
    8. Your e-mail address. The search results will be mailed back to you automatically.

  • More protein identification tools at ExPASy
    Abbreviations:

  • CODEHOP: COnsensus-DEgenerate Hybrid Oligonucleotide Primers
  • IMPALA: Integrating Matrix Profiles And Local Alignments
  • LAMA: Local Alignment of Multiple Alignments
  • MAST: Multiple Alignment and Search Tool
  • MEME: Multiple EM for Motif Elicitation
  • SMART: Simple Modular Architecture Research Tool
  • SYSTERS: SYSTEmatic Re-Searching

 

Latest update of content: September 20, 2005


Ralf Koebnik
Institut de recherche pour le dèveloppement
UMR 5096, CNRS-UP-IRD
911, Avenue Agropolis, BP 64501
34394 Montpellier, Cedex 5
FRANCE
Phone: +33 (0)4 67 41 62 28
Fax: +33 (0)4 67 41 61 81
Email: koebnik(at)gmx.de
Please replace (at) by @.


Home Back to main page