Finding a distant similarity means defining a new similarity group of
sequences and provide a description for it in terms of a sequence pattern
or profile. This description (i.e. the pattern) has to be diagnostic, i.e.
it has to cover only the members of the new group. In fact, such a diagnostic
pattern is sometimes regarded sufficient to prove the existence of the
group.
In principle, we can identify a group of sequences by some kind of a
database search. It can be simple database search, or profile search, or
pattern search. The process of identifying a group can be graphically depicted
as follows:
In the unsuccesful case, the members of the similarity group can not
be distinguished from other members of the database. In the successful case,
the new group is pulled out from underneath the pile of the other database
entries. There are four participants in this game: 1) query, 2) scoring
scheme (replacement matrix), 3) database, and 4) user. We
will now classify the search strategies based on these four factors.
As an hypothetical example, imagine that simple database search using
the BLAST program pulled out 2 sequences that are weakly similar to the
query sequence. All other suspected homologs of the query are buried somewhere
in the search results.
Using a better query, like a pattern, or using a better scoring scheme,
or a different database, will separate the known homologs from the unrelated
sequences and may reveal new members of the similarity group
The simplest thing one can do is to delete all unimportant parts of
the query and use only the suspected similarity region for database search.
Clearly, the unimportant parts of the query only provide "noise"
in the alignments, so it is better to get rid of them.
Second, we can replace the query with an alignment motif. Such a motif
can be derived intuitively, or we can use programs, such as PIMA or PROPAT for building
them. We will now use the new motif as a query. If it identifies a closed
group of sequences, we found the new similarity. If we do not succeed,
we can try to build a profile, and use that as a query, until we arrive
to a situation as shown in the above figure B. Naturally, one has to use
a great deal of biological knowledge in order to evaluate the group, e.g.
using the rules summarized in
Finally, there is a simple presentation strategy, based on the query:
Each alignment (obtained e.g. using BLAST) can be graphically mapped onto
the query and in this way the "conspicuous regions" of the query
can be easily picked up by inspection (graphic sorting).
If one can construct a special replacement matrix that is specific for a given similarity group, that matrix will eventually pull out the new similarity group in a database search. Naturally, one can not find a specific matrix for a group not yet found. However one can find a matrix that is more sensitive to most similarity groups. As already mentioned, some amino acids like C and P are structurally important while W and H are frequent participants in motifs. One can build a matrix that will assign a double cost (matrix diagonal value) for these amino acids. This matrix can be used with any search program, sometimes with good results.
We can build statistical replacement matrices based on the multiple alignments of known modules (domains). Both of these strategies are highly empirical, and there is no guarantee for success. They can provide a good starting point, however. Namely, the sequences that are fished out by this method can be further used to build motifs and profiles and the searches can be statistically evaluated.
Beside the matrix, careful choice of the gap penalty can also improve
the sensitivity. The philosophy of choosing a formula for gap penalty is
usually quite arbitrary. Some simple strategies like length dependent gaps,
improved sensitivity by themselves. One can build a gap penalty function
dependent on the predicted secondary structure of the query. For example
it is known that a-helices do not usually accept
insertions and deletions, so one can put a higher gap penalty for these
regions.
We can also clean up the database, and create a streamlined database for distant similarity searching. "Cleaning up" means discarding a part of the sequences and concentrating on a given part of the database. One method is to discard the repetitive (low complexity) part of the database. This is a simple, automated procedure which leads to a database in which the repetitive sequences are substituted with X-es. A second strategy is to retain only those parts of the database in which the structure or function is known. This is the domain-library approach employed by SBASE. Finally, one can retain only those sequence-regions about which we know that they bear similarity to other sequences. This is the approach of PRODOM. In all of these cases we can use simple search programs, but we can also use profile-search and other advanced search strategies.
The last, and perhaps most important way for cleaning up a database
is to transform it into a database of patterns. To do this, we have to
identify the homology groups and generate pattern descriptions for them.
This process is laborious and lags behind the growing databases. Furthermore,
the descriptions are inevitably biased. Nevertheless, the pattern collections
(like PROSITE or PRINTS ) are easy to use and are the first things to try
whenever a new sequence is found.
Property patterns - Flexible patterns - Classical profile method - Improved profile methods - Automated pattern generation - Automated iterative motif search - Other recent methods
Many interesting patterns are not detected simply because the user overlooks
them. During pattern hunting, the user has to evaluate a large number of
alignments and some weak patterns may be buried between random hits. Sometimes
patterns coincide with the same or similar function in various proteins,
but the user has no time or patience to look up the annotation part of
the sequence entries in which the functional information is found. All
these problems lead to the user missing interesting patterns that are otherwise
present in a database serach output. There are a number of programs, sometimes
called as output processors, that help the user to process search
outputs.
In general these are simple programs that require very little additional
computer time.
PATCO is a program that collects recurrent patterns (in fact, consensus
sequences) from FASTA outputs. It is installed on ICGEBnet as a part of the
ICGEBprot program. It is a simple algroithm which is nevertheless able
to produce quite satisfactory patterns, useable to start pattern building
with more sophisticated methods.
FTHOM is a program that examines if BLAST alignments coincide with functionally
assigned domains in Swiss-Prot. The program compares the alignment endpoints
with the feature tables and assigns the scores to feature names according
to the overlap between a feature and the alignment.
Erik Sonhammers BLAST output processing tool is an excellent example of an interactive output processor. It contains several interesting features, for example a graphic representation (graphic sorting) of the alignments along a query. As BLAST produces a large number of short alignments, this is a very useful graphic summary.
Most "advanced strategies" use a combination of the above
approaches. For example, the SBASE search server uses a domain library
as the database, and the output is graphically sorted (projected onto the
query). The BCM search launcher also uses graphic sorting, and uses a variety
of automatically generated pattern databases. Some of the recent methods
were recently reviewed by Bork and Gibson.4
The program PROPAT, developed by Bork and colleagues,5
has been applied many times in the detection of distant homologies. This
method is able to generalise a pattern, even from a rather small learning
set, by automatically deriving distinct combinations of physicochemical
properties for each position; a vector of such properties is assigned to
each amino acid . It can be used for a single motif, combinations of motifs,
or for whole domains, and is already a step towards profile searching,
since a vector of weights (in this case penalties) is assigned to each
position of the alignment (including gaps). PROPAT can search 6frame
translations of DNA databases.
The flexible patterns of Barton and Sternberg6
combine features of motifs and profiles. The patterns can be set up in
various ways but are essentially permutations of conserved blocks, separated
by gaps of specified ranges, and are compared to sequences using a dynamic
programming approach. The Barton approach has been applied, for example,
in a recent survey of the DHR domain distribution.
7
Profile analysis as implemented by Gribskov et al.8
performs exhaustive alignment by dynamic programming of a familybased
scoring matrix against test sequences. The profile is comprised of two
components for each position in the alignment: scores for the 20 amino
acids and variable gap opening and extension penalties. The amino acid
substitution scores are created by summing Dayhoff exchange matrix values
according to the observed amino acids in each column of the alignment.
Gap penalties are reduced at positions with gaps, according to the length
of the longest insertion spanning that point in the alignment. The programs
PROFILEMAKE, PROFILESEARCH and PROFILEGAP are widely available through
the GCG sequence analysis package,
9
making them the most frequently used programs in the field of motif and
profile searches. However the GCG PROFILESEARCH (including version 8.0)
does not handle current database sizes while failing to warn the
user clearly that the search is incomplete. The TPROFILESEARCH version
(available from Peter Rice, EBI, Hinxton, UK) corrects this problem.
Starting from a set of sequences in the GCG format, you can build a
PROFILE using the GCG PROFILE program. You can submit a GCG PROFILE for
searching databases on the BIOCCELERATOR (Israel) at
Searches of a PROFILE database corresponding to the PROSITE motifs is
available at ISREC (Switzerland):
A number of modifications have been suggested for improving the creation
of profiles that increase the sensitivity of the method. Several of these
improvements have been incorporated into programs such as PROFILEWEIGHT10
and the method of Luthy et al.
.11
For example, an alignment often consists of many closely related sequences together with a few rather divergent ones. The closely related sequences in the multiple alignment (learning set) offer little additional information, yet bias the profile residue scores. Sequence weighting schemes which upweight divergent sequences while downweighting closely related groupings have thus been found to improve profile sensitivity.
Noise is also reduced in database searches by gap excision30 since long
insertions are sites of breakdown in homology within the family and typically
lack meaningful conservation. Release 2 of PROFILEWEIGHT will also bring
in new gap penalty reductions based on average gap length, rather than
the single longest sequence, to better match observed gap properties in
alignments. Both TPROFILESEARCH (P. Rice, EBI Hinxton,) and the PairWise/SearchWise
package (E. Birney, J. Thompson and T. Gibson) are able to perform protein
profile alignments to 6frame translations of DNA sequences. The latter
programs use an extension to dynamic programming to compare the profile
simultaneously to the three translation frames of a DNA strand, allowing
framejumping.12
Automated pattern generation based on the PIMA program
Temple Smith and Randy Smith at Harvard University have developed a method that can automatically generate a diagnostic sequence pattern from a collection of homologous protein sequences and have used this method to construct diagnostic patterns for all protein families in the SWISS-PROT protein sequence database. Using BLASTP, a new high-speed similarity search tool of Stephen Altschul , all sequences in the SWISS-PROT protein sequence database were pair-wise compared. Similar sequences(i.e. those BLASTP scores > 65.0) were then clustered using a maximal-linkage joining rule. For SWISS-PROT release 13,this generated 2026 clusters of 2 sequences or more, encompassing 10665 of the 13837 sequences in the database. "Consensus-like" AACC (amino acid class covering) patterns were constructed from each of the sets of clustered sequences using the following steps. Starting with the most similar pair of sequences in the cluster, an optimal local alignment is performed using an extended dynamic programming algorithm. A consensus-like sequence is then constructed by 1) associating aligned residues with the least inclusive amino acid class containing the aligned residues using a simplified amino acid classification hierarchy (e.g. D/D ->D; D/E -> [DE]; D/V -> X) and 2) filling gaps with "gap characters" (each representing 0 or 1 residue of any type),retaining the longest gap observed at any one site. To accommodate AACC sequences in the dynamic programming algorithm, the scoring matrix was extended to include single-residue vs. amino acid class as well as class vs. class comparisons. To allow the proper alignment of AACC sequences containing gap characters, 1) gap characters are given a match score of 0 when aligned with any character and 2) gap penalties are reduced for the introduction of a new gap within regions containing gap characters. After construction of the AACC sequence, the two aligned input sequences are replaced by the single new AACC sequence at the node joining the input sequences in the cluster tree hierarchy .This process is then repeated for the next two most similar pair of sequences (either actual or AACC sequences) and continued until the "root" AACC sequence for the entire tree has been generated.The current release of the pattern database (rel. 4.0) includes 2026 patterns derived from all sequence families containing 2 members or more (encompassing 10665 of the 13837 sequences in SWISS-PROT rel. 13) plus all of the remaining 3173 "non-related" sequences, represented as single-member "patterns". The AACC patterns consist of upper-case characters (standard IUPAC one-letter code) representing absolutely conserved residues, lower-case characters representing pre-defined amino acid classes (shown below), wild-card characters (X), and gap-characters (g) representing 0 or 1 residues of any type.
The nomenclature of the groups is as follows:
-2 "x"
________________ X __________________ 0 " "
/ / \ \
__ f __ / ______r _______ \ 1 "."
/ / \ / / / \ \ \
/ c \ e / m p \ _ j __ 2 ":"
/ / \ \ / \ / / \ / \ \ / \ \
/ a b d \ / l k o n i h \ 3 "!"
/ / \ / \ /|\ \ / / \ / \ / \ /\ / \ / \ \
C I V L M F W Y H N D E Q K R S T A G P 5 "|"
Automated versions of this pattern building and pattern search procedure
are now available on the BCM server
http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html
Automated iterative motif search
A recent motif search method (MoST for motif search tool13
follows the BLAST strategy (in which gaps are not treated) in order to
be able to handle the resulting alignment blocks in a proper mathematical
sense. These blocks are converted into a positiondependent weighting
matrix following a logodds, Bayesianbased approach and incorporating
prior residue probabilities calculated from a mixture of Dirichlet distributions.
MoST combines an extremely fast block search with good sensitivity. In
addition, automatic iterations have been incorporated i.e. database sequences
scoring above a userdefined threshold are incorporated in the block
alignment and the weighting for the next iteration is adapted to the new
alignment. To allow for the different behaviours of protein families, manual
intervention is possible at several levels. The drawback of excluding gaps
will be circumvented in an improved version that handles the statistics
of several blocks (E.V. Koonin, pers. commun.). Thus, this method is highly
recommended, in particular, if functionally conserved residues are surrounded
by semiconserved but structurally important positions, as can be observed
in distantly related enzymes.