III. STRATEGIES FOR FINDING DISTANT SIMILARITIES

Query-based strategies

Score-based strategies

Database-driven strategies

Output-processing tools

Finding a distant similarity means defining a new similarity group of sequences and provide a description for it in terms of a sequence pattern or profile. This description (i.e. the pattern) has to be diagnostic, i.e. it has to cover only the members of the new group. In fact, such a diagnostic pattern is sometimes regarded sufficient to prove the existence of the group.

In principle, we can identify a group of sequences by some kind of a database search. It can be simple database search, or profile search, or pattern search. The process of identifying a group can be graphically depicted as follows:

In the unsuccesful case, the members of the similarity group can not be distinguished from other members of the database. In the successful case, the new group is pulled out from underneath the pile of the other database entries. There are four participants in this game: 1) query, 2) scoring scheme (replacement matrix), 3) database, and 4) user. We will now classify the search strategies based on these four factors.

As an hypothetical example, imagine that simple database search using the BLAST program pulled out 2 sequences that are weakly similar to the query sequence. All other suspected homologs of the query are buried somewhere in the search results. Using a better query, like a pattern, or using a better scoring scheme, or a different database, will separate the known homologs from the unrelated sequences and may reveal new members of the similarity group

Query-based strategies

The simplest thing one can do is to delete all unimportant parts of the query and use only the suspected similarity region for database search. Clearly, the unimportant parts of the query only provide "noise" in the alignments, so it is better to get rid of them.

Second, we can replace the query with an alignment motif. Such a motif can be derived intuitively, or we can use programs, such as PIMA or PROPAT for building them. We will now use the new motif as a query. If it identifies a closed group of sequences, we found the new similarity. If we do not succeed, we can try to build a profile, and use that as a query, until we arrive to a situation as shown in the above figure B. Naturally, one has to use a great deal of biological knowledge in order to evaluate the group, e.g. using the rules summarized in

Finally, there is a simple presentation strategy, based on the query: Each alignment (obtained e.g. using BLAST) can be graphically mapped onto the query and in this way the "conspicuous regions" of the query can be easily picked up by inspection (graphic sorting).

Score-based strategies

If one can construct a special replacement matrix that is specific for a given similarity group, that matrix will eventually pull out the new similarity group in a database search. Naturally, one can not find a specific matrix for a group not yet found. However one can find a matrix that is more sensitive to most similarity groups. As already mentioned, some amino acids like C and P are structurally important while W and H are frequent participants in motifs. One can build a matrix that will assign a double cost (matrix diagonal value) for these amino acids. This matrix can be used with any search program, sometimes with good results.

We can build statistical replacement matrices based on the multiple alignments of known modules (domains). Both of these strategies are highly empirical, and there is no guarantee for success. They can provide a good starting point, however. Namely, the sequences that are fished out by this method can be further used to build motifs and profiles and the searches can be statistically evaluated.

Beside the matrix, careful choice of the gap penalty can also improve the sensitivity. The philosophy of choosing a formula for gap penalty is usually quite arbitrary. Some simple strategies like length dependent gaps, improved sensitivity by themselves. One can build a gap penalty function dependent on the predicted secondary structure of the query. For example it is known that a-helices do not usually accept insertions and deletions, so one can put a higher gap penalty for these regions.

Database-driven strategies

We can also clean up the database, and create a streamlined database for distant similarity searching. "Cleaning up" means discarding a part of the sequences and concentrating on a given part of the database. One method is to discard the repetitive (low complexity) part of the database. This is a simple, automated procedure which leads to a database in which the repetitive sequences are substituted with X-es. A second strategy is to retain only those parts of the database in which the structure or function is known. This is the domain-library approach employed by SBASE. Finally, one can retain only those sequence-regions about which we know that they bear similarity to other sequences. This is the approach of PRODOM. In all of these cases we can use simple search programs, but we can also use profile-search and other advanced search strategies.

The last, and perhaps most important way for cleaning up a database is to transform it into a database of patterns. To do this, we have to identify the homology groups and generate pattern descriptions for them. This process is laborious and lags behind the growing databases. Furthermore, the descriptions are inevitably biased. Nevertheless, the pattern collections (like PROSITE or PRINTS ) are easy to use and are the first things to try whenever a new sequence is found.

Output-processing tools

Property patterns - Flexible patterns - Classical profile method - Improved profile methods - Automated pattern generation - Automated iterative motif search - Other recent methods

Many interesting patterns are not detected simply because the user overlooks them. During pattern hunting, the user has to evaluate a large number of alignments and some weak patterns may be buried between random hits. Sometimes patterns coincide with the same or similar function in various proteins, but the user has no time or patience to look up the annotation part of the sequence entries in which the functional information is found. All these problems lead to the user missing interesting patterns that are otherwise present in a database serach output. There are a number of programs, sometimes called as output processors, that help the user to process search outputs. In general these are simple programs that require very little additional computer time.

PATCO is a program that collects recurrent patterns (in fact, consensus sequences) from FASTA outputs. It is installed on ICGEBnet as a part of the ICGEBprot program. It is a simple algroithm which is nevertheless able to produce quite satisfactory patterns, useable to start pattern building with more sophisticated methods.

FTHOM is a program that examines if BLAST alignments coincide with functionally assigned domains in Swiss-Prot. The program compares the alignment endpoints with the feature tables and assigns the scores to feature names according to the overlap between a feature and the alignment.³ FTHOM is a part of the domain server (domain@hubi.abc.hu).

Erik Sonhammers BLAST output processing tool is an excellent example of an interactive output processor. It contains several interesting features, for example a graphic representation (graphic sorting) of the alignments along a query. As BLAST produces a large number of short alignments, this is a very useful graphic summary.

Most "advanced strategies" use a combination of the above approaches. For example, the SBASE search server uses a domain library as the database, and the output is graphically sorted (projected onto the query). The BCM search launcher also uses graphic sorting, and uses a variety of automatically generated pattern databases. Some of the recent methods were recently reviewed by Bork and Gibson.⁴

Property patterns

The program PROPAT, developed by Bork and colleagues,⁵ has been applied many times in the detection of distant homologies. This method is able to generalise a pattern, even from a rather small learning set, by automatically deriving distinct combinations of physicochemical properties for each position; a vector of such properties is assigned to each amino acid . It can be used for a single motif, combinations of motifs, or for whole domains, and is already a step towards profile searching, since a vector of weights (in this case penalties) is assigned to each position of the alignment (including gaps). PROPAT can search 6frame translations of DNA databases.

Flexible patterns

The flexible patterns of Barton and Sternberg ⁶ combine features of motifs and profiles. The patterns can be set up in various ways but are essentially permutations of conserved blocks, separated by gaps of specified ranges, and are compared to sequences using a dynamic programming approach. The Barton approach has been applied, for example, in a recent survey of the DHR domain distribution.⁷

The classical profile method

Profile analysis as implemented by Gribskov et al.⁸ performs exhaustive alignment by dynamic programming of a familybased scoring matrix against test sequences. The profile is comprised of two components for each position in the alignment: scores for the 20 amino acids and variable gap opening and extension penalties. The amino acid substitution scores are created by summing Dayhoff exchange matrix values according to the observed amino acids in each column of the alignment. Gap penalties are reduced at positions with gaps, according to the length of the longest insertion spanning that point in the alignment. The programs PROFILEMAKE, PROFILESEARCH and PROFILEGAP are widely available through the GCG sequence analysis package,⁹ making them the most frequently used programs in the field of motif and profile searches. However the GCG PROFILESEARCH (including version 8.0) does not handle current database sizes while failing to warn the user clearly that the search is incomplete. The TPROFILESEARCH version (available from Peter Rice, EBI, Hinxton, UK) corrects this problem.

Starting from a set of sequences in the GCG format, you can build a PROFILE using the GCG PROFILE program. You can submit a GCG PROFILE for searching databases on the BIOCCELERATOR (Israel) at

http://sgbcd.weizmann.ac.il/Bic/ExecAppl.html/

Searches of a PROFILE database corresponding to the PROSITE motifs is available at ISREC (Switzerland):

http://ulrec3.unil.ch/software/PFSCAN_form.html

Improved profile methods

A number of modifications have been suggested for improving the creation of profiles that increase the sensitivity of the method. Several of these improvements have been incorporated into programs such as PROFILEWEIGHT ¹⁰ and the method of Luthy et al..¹¹

For example, an alignment often consists of many closely related sequences together with a few rather divergent ones. The closely related sequences in the multiple alignment (learning set) offer little additional information, yet bias the profile residue scores. Sequence weighting schemes which upweight divergent sequences while downweighting closely related groupings have thus been found to improve profile sensitivity.

Noise is also reduced in database searches by gap excision30 since long insertions are sites of breakdown in homology within the family and typically lack meaningful conservation. Release 2 of PROFILEWEIGHT will also bring in new gap penalty reductions based on average gap length, rather than the single longest sequence, to better match observed gap properties in alignments. Both TPROFILESEARCH (P. Rice, EBI Hinxton,) and the PairWise/SearchWise package (E. Birney, J. Thompson and T. Gibson) are able to perform protein profile alignments to 6frame translations of DNA sequences. The latter programs use an extension to dynamic programming to compare the profile simultaneously to the three translation frames of a DNA strand, allowing framejumping.¹²

Automated pattern generation based on the PIMA program

Temple Smith and Randy Smith at Harvard University have developed a method that can automatically generate a diagnostic sequence pattern from a collection of homologous protein sequences and have used this method to construct diagnostic patterns for all protein families in the SWISS-PROT protein sequence database. Using BLASTP, a new high-speed similarity search tool of Stephen Altschul , all sequences in the SWISS-PROT protein sequence database were pair-wise compared. Similar sequences(i.e. those BLASTP scores > 65.0) were then clustered using a maximal-linkage joining rule. For SWISS-PROT release 13,this generated 2026 clusters of 2 sequences or more, encompassing 10665 of the 13837 sequences in the database. "Consensus-like" AACC (amino acid class covering) patterns were constructed from each of the sets of clustered sequences using the following steps. Starting with the most similar pair of sequences in the cluster, an optimal local alignment is performed using an extended dynamic programming algorithm. A consensus-like sequence is then constructed by 1) associating aligned residues with the least inclusive amino acid class containing the aligned residues using a simplified amino acid classification hierarchy (e.g. D/D ->D; D/E -> [DE]; D/V -> X) and 2) filling gaps with "gap characters" (each representing 0 or 1 residue of any type),retaining the longest gap observed at any one site. To accommodate AACC sequences in the dynamic programming algorithm, the scoring matrix was extended to include single-residue vs. amino acid class as well as class vs. class comparisons. To allow the proper alignment of AACC sequences containing gap characters, 1) gap characters are given a match score of 0 when aligned with any character and 2) gap penalties are reduced for the introduction of a new gap within regions containing gap characters. After construction of the AACC sequence, the two aligned input sequences are replaced by the single new AACC sequence at the node joining the input sequences in the cluster tree hierarchy .This process is then repeated for the next two most similar pair of sequences (either actual or AACC sequences) and continued until the "root" AACC sequence for the entire tree has been generated.The current release of the pattern database (rel. 4.0) includes 2026 patterns derived from all sequence families containing 2 members or more (encompassing 10665 of the 13837 sequences in SWISS-PROT rel. 13) plus all of the remaining 3173 "non-related" sequences, represented as single-member "patterns". The AACC patterns consist of upper-case characters (standard IUPAC one-letter code) representing absolutely conserved residues, lower-case characters representing pre-defined amino acid classes (shown below), wild-card characters (X), and gap-characters (g) representing 0 or 1 residues of any type.

The nomenclature of the groups is as follows:

                                                            -2  "x"

          ________________ X __________________              0  " "
         /          /           \              \
     __ f __       /       ______r _______      \            1  "."
    /  /    \     /       /   /     \     \      \
   /  c      \   e       /   m       p     \    _ j __       2  ":"
  /  / \      \ / \     /   / \     / \     \  /   \  \
 /  a    b     d   \   /   l   k   o    n     i     h  \     3 "!"
/  / \  / \   /|\   \ /   / \ / \ / \   /\   / \   / \  \
C  I V  L M  F W Y   H   N   D   E   Q  K R  S T   A G   P   5 "|"

Automated versions of this pattern building and pattern search procedure are now available on the BCM server

http://dot.imgen.bcm.tmc.edu:9331/seq-search/protein-search.html

Automated iterative motif search

A recent motif search method (MoST for motif search tool ¹³ follows the BLAST strategy (in which gaps are not treated) in order to be able to handle the resulting alignment blocks in a proper mathematical sense. These blocks are converted into a positiondependent weighting matrix following a logodds, Bayesianbased approach and incorporating prior residue probabilities calculated from a mixture of Dirichlet distributions. MoST combines an extremely fast block search with good sensitivity. In addition, automatic iterations have been incorporated i.e. database sequences scoring above a userdefined threshold are incorporated in the block alignment and the weighting for the next iteration is adapted to the new alignment. To allow for the different behaviours of protein families, manual intervention is possible at several levels. The drawback of excluding gaps will be circumvented in an improved version that handles the statistics of several blocks (E.V. Koonin, pers. commun.). Thus, this method is highly recommended, in particular, if functionally conserved residues are surrounded by semiconserved but structurally important positions, as can be observed in distantly related enzymes.

Other recent methods

In addition to further improvements of techniques related to the ones mentioned above, other recent approaches that might prove valuable in identifying distant homologies include the application of neural networks ¹⁴ and methods that try to tackle the automatic iteration of the search procedure.¹⁵