Distant similarities: an introduction
Pattern terminology: confusing names to simple things
Sequences are a kind of chemical structures or texts
Similarity of two sequences: alignments
Biological significance, mathematical significance
Similarity groups:patterns, motifs, signatures
Patterns as representations of similarity groups
Profiles as representations of similarity groups
Database search is probably the most widely used approach to predict the biological function of a newly determined protein sequence. The likelihood of finding homologues (i.e. evolutionarily related proteins) to a sequence is currently higher than 80% for bacteria; 70% for yeast and perhaps 60% for animal sequence queries.The rest are truely novel proteins, but an estimated half of them are non-trivial cases that can be identified only by more sophisticated methods of analysis. We refer to these cases as "distant similarities".
If two large proteins share only a few common domains, or the level
of identity is around 25% or below, detection of similarity often becomes
difficult. First, random identities present in the alignment may
"mask"
the biologically important sequence patterns. Second, search programs such
as FASTA, FASTDB or BLAST give only the best homology regions between the
query and a database entry, so weaker homologies between the query and
the same entry can remain undetected. In this "twighlight zone"
of homologies (Rus Doolittle's expression), biological significance does
not coincide with mathematical significance, so simple similarity searches
do not give unequivocal results. In these cases, the experimenter is often
tempted to give up saying "the sequence showed no significant homology
to any other sequence in the database". In this chapter we describe
a number of simple possibilities to analyze a newly determined protein
sequence that you may try in order to go beyond the scope of simple database
search.
There are no general solutions for finding distant homologies. However,
there are a number of useful solutions that are likely to work with various
problems, so it is a good idea to try a number of different programs and
databases. ICGEBnet maintains a growing collection of these and useful
WWW-links are provided in the last chapter of this tutorial.
When discussing sequence homology we use terms as "patterns", "motifs" and "similarity". These are broad and vague terms, it is enough to think about concepts like patterns of behaviour or musical motifs. When we use these terms for sequences, we use them in a given context and a restricted sense. Typically, a PROSITE pattern (see below) is a special sequence pattern that makes sense only in the context of the PROSITE database and the software that can be used to search it. In other terms, there is a proper mathematical definition of patterns used in PROSITE and in the search software. If we keep this in mind we will not be disturbed by the fact that some workers call the PROSITE patterns as "motifs" - they refer to the same thing. This kind of problem is characteristic of every new field in science.
If we still want to make a distinction, the word "motif" is
used for simpler patterns of identity that are usually short and can be
described in simple terms. The word "pattern" itself is more
used to more detailed descriptions, typically for those comprising several
motifs. However, it is also correct to speak about a "hydrophobicity
pattern", i.e. the distribution of a quantitative property along the
sequence. Finally, the word "profile" is used mostly for complex,
quantitative descriptions derived from multiple alignment.
Chemical structures are collections of elements (typically atoms) with connections (like chemical bonds) between them. Protein sequences are composed of a limited set of special elements, amino acid residues, and only one type of connection is represented, that of the protein backbone which only shows which residues are next to each other (this connection can be also termed "sequential vicinity"). This representation is very similar to texts, especially if we use the one-letter code for amino acids. In fact, most mathematical methods used for sequences originate from text processing. Enough to think about concepts such as "deletions" and "insertions" now widely used in genetics which also come from text editing.
So we have two metaphors for sequences: chemical structures and
texts. When to use these? Sequence analysis, such as database searching,
can be much better understood with the text metaphor. This is valid to
all problems where we keep the amino acid residue representation as standard.
If we need more versatile descriptions, the chemical structure metaphor
is more useful. For example, the structure of multidomain proteins can
be described as a series of complex elements, such as signal peptides,
immunoglobulin-domains etc. The chemical structure analogy is especially
conspicuous when dealing with the 3D structure of proteins composed of
a few a-helices and b-sheets
in a special arrangement.
The concept of similarity is basically a very complex notion. For sequences we use a very restricted definition to which a mathematical concept can be assigned. The similarity of two sequences is best defined by the alignment of the two sequences, and can be characterized by the alignment pattern and the similarity score.
There are many ways to align sequences, and by "alignment"
we refer to the best alignment that has the maximum number of identical
residues.
By alignment pattern we refer to the pattern of identical residues as shown
in the following example:
1 5 10 sequence 1 CPKICIGGWFA sequence 2 CSGICKKAWFG "alignment pattern" C..IC...WF
We chose here a simple representation for the alignment pattern, called
a consensus sequence. There are other ways to represent such motifs. Generally,
alignments also contain insertions or gaps (see below).
The second characteristics of an alignment is the so called similarity score. The standard techniques of alignment - the same as are widely used in database searching - use an empirical similarity score which is defined such as to be high for good alignments. Simply put, a similarity score is a sum of cost values assigned to identities and replacements, minus a sum of penalty values assigned to gaps within an alignments. The values of identities and replacements are the elements of the so-called replacement matrix (the Dayhoff matrix is the traditional choice for evolutionary relations, the more recent BLOSUM matrix is more used for "general purpose" database searches). The penalty values can be in various forms (uniform, gap-length-dependent, in more sophisticated forms even sequence dependent). In any case, the general formula of the similarity score s is as follows:
In many alignment and database search programs, the similarity score
is defined in such a way that for identical sequences it gives a value
equal to the sequence length.
Can we use the alignment pattern and the similarity score to describe
similarity? Yes, but only in the context of a given sequences, or within
a given database.
The similarity score can be statistically compared with other similarity
scores (obtained between the same query sequence and other members of the
database, or with random shuffled sequences). The mathematical significance
calculated in this manner is usually based on the simplest statistics such
as the Student test.
The use of mathematical significance in sequence analysis is only auxiliary,
it is recommended only if there are no other, biological proofs for homology.
By biological significance of an alignment we simply mean all possible biological explanations that suggest that the similarity of the two sequences is not by chance. The first group of arguments depend on the sequences compared and we can not deal with them in detail here. For example, if both sequences are extracellular, both are from multidomain proteins, both have the same intron types, etc., then it is more probable that the similarity we found is not by pure chance.
The second group of arguments comes from the experiences of researchers developing patterns and motifs. These are widely used but qualitative and approximate rules that help one to evaluate alignment. E.g.:
One of the ways to deal with this problem is to leave out repetitive sequences from a database. This is a very subjective procedure, since one has to define subjectively what actually will be considered repetitive. However, it is very useful and simple. John Wootton designed a program that can "blank out" repetitive sequences from a sequence database. The keyword used for this problem is "complexity". Sequences that are repetitive are of "low complexity".
There are many other interesting practical rules (after Bork and Gibson).
The take-home message is simple. Mathematical significance and biological
significance are not the same thing. If two (long) sequences have less
than 25 % identity, one has to use biological knowledge in order to evaluate
the alignment. Similarity of two sequences covers two useful concepts related
to alignments: i) A similarity score, which has to be statistically evaluated
for significance; and ii) An alignment pattern, that can be evaluated by
empirical rules and by statistic occurrence.
A group of sequences sharing a particular alignment motif is called a "similarity group". The regions of similarity usually include only a part of a sequence. By similarity group we mean a collection of these similarity regions, as schematically shown in the figure below. Collections of similarity groups can be made, manually, with human work, like with the SBASE protein domain sequence library. Or automatically, like in the case of PRODOM.
Sequences of a similarity group can be subjected to multiple alignment in order to find all conserved residues i.e. the complete pattern. these can be represented in forms of regular expressions, or in dedicated mathematical forms.
The (complete) pattern can then be used to scan the the database and
to locate all proteins that contain it. For this purpose, we need a quatitative
measure. There are various measures, e.g. yes-or-no type measures are used
with simple pattern descriptions (see below). More refined methods use
a similarity score, which is actually identical to those described for
alignments. Such a similarity score can be subjected to statistics, so
one can quantitatively describe whether a similarity group is different from
the rest of the database. Here we can use the Student test for quantitating
the difference.
This is very useful when one builds a pattern and wants to test if an improved
pattern is better than the previous version. For simple applications it
is sufficient to say, that there is a preset threshold value, and if comparison
of a pattern with a sequence produces a similarity score above the threshold,
the sequence contains the pattern.
A pattern is "diagnostic" if it can be used to locate all sequences from which it was derived. In less fortunate cases there will be a number of sequences that contain the pattern even though they do not (or, for biological reasons, can not) belong to the original group. These are called false positives. Conversely, true members of the group that are missed by the pattern are called false negatives. The distribution of the scores gives a very clear meaning for these terms, and also explains the meaning of statistical significance of a pattern. (Like in the previous example, the t value is a measure of the separation of the "postive" and "negative" groups).
Many similarity groups are very well conserved so one does not need
the full length pattern in order to identify their members without false
positives and false negatives. In these cases it is enough to use a small
conserved part of the pattern, and such short diagnostic patterns are called
"signatures" (Amos Bairoch's expression). In most cases
one can use a simple yes-or-no test to see if the signature is found within
a sequence. The use of signatures has the danger however that new numbers
of the similarity group may not share them (even though the homology is
complete at other parts of the sequence) so they can be missed. Patterns,
motifs or signatures are names designating consensus representations of
a similarity group. The property which is most important for us here is
that these consensus representations allow to retrieve possibly all members
of a similarity group. There are simple statistical measures that allow
to calculate the accuracy of a pattern in database searching.
The basic step of building a pattern is multiple alignment. Standard
programs, like CLUSTAL, the PILEUP program of the GCG package, are very
useful starting points to identify the "core" of an alignment.
One can usually improve the patterns by intuition, adding residues on both
sides of the core. More refined programs (like the PROFILE program of
Gribskov, and its more recent versions) can be used in the same way. An
elementary example of pattern building is given in here. Once familiar with the basics and
having a good knowledge on the group of proteins, one can start building
patterns. Some basic rules are summarized in
- they describe how to select some of the usual
parameters of
database and profile search, like search matrices, gap penalties, etc.
Summarizing, we can define similarity groups as a group of sequences
that carry a similar sequence pattern and that potentially have a similar
biological function. The task of the researcher is to establish if this
group is biologically significant, and if this is the case, a new pattern
is born.
Patterns or motifs are generalized representations of a similarity group. They incorporate "all common features" of the sequence group. Naturally, "all common features" can mean an endless variety of things, this is why there are in fact many forms for representing patterns.
The simplest form is the regular expression,
a well known mathematical term. For example, the pattern
[RK]-x(2)-[DE]-x(3)-Y
is such a description of a tyrosine phosphorylation site. This site contains an arginine or lysine residue, followed by two residues of any type, then an aspartic or glutamic acid, followed by 3 residues of any type and a tyrosine residue. There are very fast search algorithms for locating regular expressions in long text files such as sequences. Regular expressions also allow flexible statements such as "P(2-5)" meaning 2 to five prolines. Their only drawback is that they fail at any single mismatch. If such cases are still important from the biological point view, then regular expressions are not suitable for describing the homology group. The PROSITE collection contains many hundreds of signatures for a wide variety of homology groups (ligand binding sites, active sites, etc. - see the PROSITE chapter of the Databanks section of this tutorial). There are fast programs that can be used to search PROSITE, such as MOTIFS in the GCG package.
Consensus sequences are a simpler form of regular expressions that are
often used in molecular biology. The simplest consensus sequences do not
allow flexible statements (such as threonine or serine" or "2
to five residues of any kind"). For example, the tyrosine phosphorylation
site "[RK]-x(2)-[DE]-x(3)-Y" can be translated into the following
consensus sequences
RXXDXXXY RXXEXXXY KXXDXXXY KXXEXXXY
The difference is obvious: we had to write several consensus sequences
instead of one regular expression. The advantage of consensus sequences
is however that they can be used simply as queries with fast database search
programs like FASTA and FASTDB, so the experimenter can very easily verify
if a pattern is shared by a biologocially meaningful group of sequences.
In this case the answer is a score value, not a simple yes or no, like
with regular expressions. Consensus sequences are very useful in searching
for long and loosely defined patterns. On the other hand, short patterns,
like the ones in the previous example can be better searched with regular
expressions.
Temple Smith and Randy Smith use an extended terminology for consensus
sequences that allows the use of variable residues and and gaps. Their
PIMA program automatically generates "complete" consensus patterns
for predefined groups of sequences. They constructed a comprehensive consensus
sequence library called PLSEARCH from all the known protein superfamilies
which can be used to locate regions of similarity in a newly determined
sequence (see below).
Profiles are specific representations that incorporate the entire information
of a multiple alignment. The simplest forms of profiles can be pictured
as an averaged representation of all sequences in a multiple alignment.
This averaged "phantom sequence" representation can then be used
as a query in database search. The trick is simple. We have to remember
that in alignments (or database searching) each amino acid type has a column
in the replacement matrix (e.g. the Dayhoff matrix) which contains the
20 cost values for replacing it with any of the 20 other amino acids. These
values are independent on the position of an amino acid within the sequence.
Using a multiple alignment, one can make a sequence-dependent replacement
matrix, and this will be the "profile". This can be demonstrated
on the simple example already shown above:
1
5 10
sequence 1 CPKICIGGWFA
sequence 2 CSGICKKAWFG
For position 1 we will use the C column of the replacement matrix. for
position 2, we will use an average of the P and S columns. For position
3, we will use the average of the K and S column, etc. Positions 9 and
10 are conserved, so there we will use the original W and F columns of
the replacement matrix, respectively. By doing this transformation, we
will end up with a new matrix, which will have 20 rows (the original row
number of the matrix) but 11 columns, i.e. as many colunms as is the
length of the alignment. This is a simple profile that can be used in dabase
search programs with little modifications. Naturally, one can use a
variety of other, much more sophisticated methods to build profiles.
Patterns and profiles are average representations, and all averages are biased. In our case it means that the pattern will faithfully reflect the properties of the similarity gorup we are working with, and it can be severely biased if the group is not properly selected. For example, fibronectin type 3 repeats (FN3-repeats) have two major types, the one occuring in fibronectins FN3A and the ones occuring in cell adhesion molecules FN3B. FN3A repeats were known much before and in much greater numbers. FN3B repeats were discovered later. At a given point of time, the database contained 120 FN3A and 10 FN3B repeats. If one built separate patterns for the FN3A group, the pattern recognized only a few of the FN3B group (vica versa, the FN3B pattern recognized only some of the FN3As). If one combines all 130 sequences in one group for building a pattern, the resulting pattern mainly reflected the properties of the FN3A, i.e. the majority group. This pattern recognizes the FN3A group much better, and hardly recognizes the FN3B group. To this problem there is no theoretical solution. One practical solution is to take an equal number of the "representative groups". In this case one could keep all FN3Bs and add 10 "well selected" members of the FN3A family. In fact, such a strategy allows quite frequently to find new domain examples in the database.
There are a number of problems with selecting a good set, however, i) The collection contains repeats of the same protein and these are sometimes more closely related to each other, than to repeats of other families. So it might be better to select one repeat from every protein. ii) Sometimes the serial order of the repeats is important. For example, repeat 1 in each protein is highly homologous among the proteins, repeat 2 is not etc. So it might make sense to select one of each group.
A strategy to avoid the bias of patterns is the use of a domain library, such as SBASE and PRODOM. Instead of consensus representations, SBASE contains several copies for known functional domains, so spurious homologies to individual members of a homology groups be picked up by simple database search. Searching this domain library with a program like BLAST or FASTA can give you direct information on functional homologies to segments of your protein sequence (see below). With this approach, the similarity group itself is used as a diagnostic tool.
There are two ways to build domain databases. SBASE is a collection
of domains annotated in sequence databases. The domain boundaries are man
made and correspond to some established function or structure. (In fact
SBASE contains functional, structural, cellular topology-directed domains
as well as simple repeats found in various proteins). The advantage of
this approach is that it gives a clear indication of the information known
about a certain domain type. PRODOM is a more recent database based on
the Swiss-Prot database. It is a comprehensive collection of regions similar
to each other, prepared by aligning each entry in Swiss-Prot with each
other entry. This is a machine-based approach that usually gives similarity
regions that do not coincide with the known boundaries of domains, modules
etc. Since it is machine-made, it can also contain quite novel similarity
groups.