II. BASIC CONCEPTS AND TERMINOLOGY

Pattern terminology: confusing names to simple things

Sequences are a kind of chemical structures or texts

Biological significance, mathematical significance

Patterns as representations of similarity groups

Profiles as representations of similarity groups

The bias of patterns and profiles

Collections of similarity groups: domain databases

Distant similarities: an introduction

Database search is probably the most widely used approach to predict the biological function of a newly determined protein sequence. The likelihood of finding homologues (i.e. evolutionarily related proteins) to a sequence is currently higher than 80% for bacteria; 70% for yeast and perhaps 60% for animal sequence queries.The rest are truely novel proteins, but an estimated half of them are non-trivial cases that can be identified only by more sophisticated methods of analysis. We refer to these cases as "distant similarities".

If two large proteins share only a few common domains, or the level of identity is around 25% or below, detection of similarity often becomes difficult. First, random identities present in the alignment may "mask" the biologically important sequence patterns. Second, search programs such as FASTA, FASTDB or BLAST give only the best homology regions between the query and a database entry, so weaker homologies between the query and the same entry can remain undetected. In this "twighlight zone" of homologies (Rus Doolittle's expression), biological significance does not coincide with mathematical significance, so simple similarity searches do not give unequivocal results. In these cases, the experimenter is often tempted to give up saying "the sequence showed no significant homology to any other sequence in the database". In this chapter we describe a number of simple possibilities to analyze a newly determined protein sequence that you may try in order to go beyond the scope of simple database search.

There are no general solutions for finding distant homologies. However, there are a number of useful solutions that are likely to work with various problems, so it is a good idea to try a number of different programs and databases. ICGEBnet maintains a growing collection of these and useful WWW-links are provided in the last chapter of this tutorial.

Pattern terminology: confusing names to simple things

When discussing sequence homology we use terms as "patterns", "motifs" and "similarity". These are broad and vague terms, it is enough to think about concepts like patterns of behaviour or musical motifs. When we use these terms for sequences, we use them in a given context and a restricted sense. Typically, a PROSITE pattern (see below) is a special sequence pattern that makes sense only in the context of the PROSITE database and the software that can be used to search it. In other terms, there is a proper mathematical definition of patterns used in PROSITE and in the search software. If we keep this in mind we will not be disturbed by the fact that some workers call the PROSITE patterns as "motifs" - they refer to the same thing. This kind of problem is characteristic of every new field in science.

If we still want to make a distinction, the word "motif" is used for simpler patterns of identity that are usually short and can be described in simple terms. The word "pattern" itself is more used to more detailed descriptions, typically for those comprising several motifs. However, it is also correct to speak about a "hydrophobicity pattern", i.e. the distribution of a quantitative property along the sequence. Finally, the word "profile" is used mostly for complex, quantitative descriptions derived from multiple alignment.

Sequences are a kind of chemical structures or texts

Chemical structures are collections of elements (typically atoms) with connections (like chemical bonds) between them. Protein sequences are composed of a limited set of special elements, amino acid residues, and only one type of connection is represented, that of the protein backbone which only shows which residues are next to each other (this connection can be also termed "sequential vicinity"). This representation is very similar to texts, especially if we use the one-letter code for amino acids. In fact, most mathematical methods used for sequences originate from text processing. Enough to think about concepts such as "deletions" and "insertions" now widely used in genetics which also come from text editing.

So we have two metaphors for sequences: chemical structures and texts. When to use these? Sequence analysis, such as database searching, can be much better understood with the text metaphor. This is valid to all problems where we keep the amino acid residue representation as standard. If we need more versatile descriptions, the chemical structure metaphor is more useful. For example, the structure of multidomain proteins can be described as a series of complex elements, such as signal peptides, immunoglobulin-domains etc. The chemical structure analogy is especially conspicuous when dealing with the 3D structure of proteins composed of a few a-helices and b-sheets in a special arrangement.

Similarity of two sequences: alignments

The concept of similarity is basically a very complex notion. For sequences we use a very restricted definition to which a mathematical concept can be assigned. The similarity of two sequences is best defined by the alignment of the two sequences, and can be characterized by the alignment pattern and the similarity score.

There are many ways to align sequences, and by "alignment" we refer to the best alignment that has the maximum number of identical residues. By alignment pattern we refer to the pattern of identical residues as shown in the following example:

                       1   5    10
sequence 1             CPKICIGGWFA
sequence 2             CSGICKKAWFG
"alignment pattern"    C..IC...WF

We chose here a simple representation for the alignment pattern, called a consensus sequence. There are other ways to represent such motifs. Generally, alignments also contain insertions or gaps (see below).

The second characteristics of an alignment is the so called similarity score. The standard techniques of alignment - the same as are widely used in database searching - use an empirical similarity score which is defined such as to be high for good alignments. Simply put, a similarity score is a sum of cost values assigned to identities and replacements, minus a sum of penalty values assigned to gaps within an alignments. The values of identities and replacements are the elements of the so-called replacement matrix (the Dayhoff matrix is the traditional choice for evolutionary relations, the more recent BLOSUM matrix is more used for "general purpose" database searches). The penalty values can be in various forms (uniform, gap-length-dependent, in more sophisticated forms even sequence dependent). In any case, the general formula of the similarity score s is as follows:

In many alignment and database search programs, the similarity score is defined in such a way that for identical sequences it gives a value equal to the sequence length.

Biological significance, mathematical significance

Can we use the alignment pattern and the similarity score to describe similarity? Yes, but only in the context of a given sequences, or within a given database. The similarity score can be statistically compared with other similarity scores (obtained between the same query sequence and other members of the database, or with random shuffled sequences). The mathematical significance calculated in this manner is usually based on the simplest statistics such as the Student test. The use of mathematical significance in sequence analysis is only auxiliary, it is recommended only if there are no other, biological proofs for homology.

By biological significance of an alignment we simply mean all possible biological explanations that suggest that the similarity of the two sequences is not by chance. The first group of arguments depend on the sequences compared and we can not deal with them in detail here. For example, if both sequences are extracellular, both are from multidomain proteins, both have the same intron types, etc., then it is more probable that the similarity we found is not by pure chance.

The second group of arguments comes from the experiences of researchers developing patterns and motifs. These are widely used but qualitative and approximate rules that help one to evaluate alignment. E.g.:

Frequent and infrequent residues. Some amino acids are very frequent in sequences. E.g. there are entries very rich in glycine, or very rich in hydrophobic residues (I,L,V,F). All of these will be seemingly similar to each other. So, even if two sequences align well, one has intuitively less trust in a pattern of only hydrophobic residues. these can occur by chance. On the other hand, a pattern containing a number of tryptophans (i.e. a rare amino acid) may be taken more seriously.
Repetitive sequences. some sequences will have long runs of identical amino acids (or dipeptides, tripeptides etc.) These will show erroneous similarity to many sequences rich in the same amino acids. for example, collagen has approximate tripeptide repeats with P and G at certain positions. Collagens will thus produce a seemingly high alignment score when aligned with proline-rich or glycine-rich proteins.

One of the ways to deal with this problem is to leave out repetitive sequences from a database. This is a very subjective procedure, since one has to define subjectively what actually will be considered repetitive. However, it is very useful and simple. John Wootton designed a program that can "blank out" repetitive sequences from a sequence database. The keyword used for this problem is "complexity". Sequences that are repetitive are of "low complexity".

Structurally important amino acids. Cysteins are the best examples as they can form disulfide bridges that stabilize a 3D fold. Also prolines can be important, because they can break alpha helices and can make parts of turns in the backbone. Both of these have to be taken with caution, however. Namely, there are cysteine-rich (or proline-rich) proteins that will produce erroneously high similarity scores with any protein that contains an otherwise interesting cysteine (or proline) pattern. Nevertheless, a particular distribution of cysteines is often very characteristic of a protein or protein module, sometimes this is the only thing that characterizes such a group.

There are many other interesting practical rules (after Bork and Gibson). The take-home message is simple. Mathematical significance and biological significance are not the same thing. If two (long) sequences have less than 25 % identity, one has to use biological knowledge in order to evaluate the alignment. Similarity of two sequences covers two useful concepts related to alignments: i) A similarity score, which has to be statistically evaluated for significance; and ii) An alignment pattern, that can be evaluated by empirical rules and by statistic occurrence.

Similarity groups:patterns, motifs, signatures

A group of sequences sharing a particular alignment motif is called a "similarity group". The regions of similarity usually include only a part of a sequence. By similarity group we mean a collection of these similarity regions, as schematically shown in the figure below. Collections of similarity groups can be made, manually, with human work, like with the SBASE protein domain sequence library. Or automatically, like in the case of PRODOM.

Sequences of a similarity group can be subjected to multiple alignment in order to find all conserved residues i.e. the complete pattern. these can be represented in forms of regular expressions, or in dedicated mathematical forms.

The (complete) pattern can then be used to scan the the database and to locate all proteins that contain it. For this purpose, we need a quatitative measure. There are various measures, e.g. yes-or-no type measures are used with simple pattern descriptions (see below). More refined methods use a similarity score, which is actually identical to those described for alignments. Such a similarity score can be subjected to statistics, so one can quantitatively describe whether a similarity group is different from the rest of the database. Here we can use the Student test for quantitating the difference. This is very useful when one builds a pattern and wants to test if an improved pattern is better than the previous version. For simple applications it is sufficient to say, that there is a preset threshold value, and if comparison of a pattern with a sequence produces a similarity score above the threshold, the sequence contains the pattern.

A pattern is "diagnostic" if it can be used to locate all sequences from which it was derived. In less fortunate cases there will be a number of sequences that contain the pattern even though they do not (or, for biological reasons, can not) belong to the original group. These are called false positives. Conversely, true members of the group that are missed by the pattern are called false negatives. The distribution of the scores gives a very clear meaning for these terms, and also explains the meaning of statistical significance of a pattern. (Like in the previous example, the t value is a measure of the separation of the "postive" and "negative" groups).

Many similarity groups are very well conserved so one does not need the full length pattern in order to identify their members without false positives and false negatives. In these cases it is enough to use a small conserved part of the pattern, and such short diagnostic patterns are called "signatures" (Amos Bairoch's expression). In most cases one can use a simple yes-or-no test to see if the signature is found within a sequence. The use of signatures has the danger however that new numbers of the similarity group may not share them (even though the homology is complete at other parts of the sequence) so they can be missed. Patterns, motifs or signatures are names designating consensus representations of a similarity group. The property which is most important for us here is that these consensus representations allow to retrieve possibly all members of a similarity group. There are simple statistical measures that allow to calculate the accuracy of a pattern in database searching.

The basic step of building a pattern is multiple alignment. Standard programs, like CLUSTAL, the PILEUP program of the GCG package, are very useful starting points to identify the "core" of an alignment. One can usually improve the patterns by intuition, adding residues on both sides of the core. More refined programs (like the PROFILE program of Gribskov, and its more recent versions) can be used in the same way. An elementary example of pattern building is given in here. Once familiar with the basics and having a good knowledge on the group of proteins, one can start building patterns. Some basic rules are summarized in - they describe how to select some of the usual parameters of database and profile search, like search matrices, gap penalties, etc.

Summarizing, we can define similarity groups as a group of sequences that carry a similar sequence pattern and that potentially have a similar biological function. The task of the researcher is to establish if this group is biologically significant, and if this is the case, a new pattern is born.

Patterns as representations of similarity groups

Patterns or motifs are generalized representations of a similarity group. They incorporate "all common features" of the sequence group. Naturally, "all common features" can mean an endless variety of things, this is why there are in fact many forms for representing patterns.

The simplest form is the regular expression, a well known mathematical term. For example, the pattern

[RK]-x(2)-[DE]-x(3)-Y

is such a description of a tyrosine phosphorylation site. This site contains an arginine or lysine residue, followed by two residues of any type, then an aspartic or glutamic acid, followed by 3 residues of any type and a tyrosine residue. There are very fast search algorithms for locating regular expressions in long text files such as sequences. Regular expressions also allow flexible statements such as "P(2-5)" meaning 2 to five prolines. Their only drawback is that they fail at any single mismatch. If such cases are still important from the biological point view, then regular expressions are not suitable for describing the homology group. The PROSITE collection contains many hundreds of signatures for a wide variety of homology groups (ligand binding sites, active sites, etc. - see the PROSITE chapter of the Databanks section of this tutorial). There are fast programs that can be used to search PROSITE, such as MOTIFS in the GCG package.

Consensus sequences are a simpler form of regular expressions that are often used in molecular biology. The simplest consensus sequences do not allow flexible statements (such as threonine or serine" or "2 to five residues of any kind"). For example, the tyrosine phosphorylation site "[RK]-x(2)-[DE]-x(3)-Y" can be translated into the following consensus sequences

RXXDXXXY
RXXEXXXY
KXXDXXXY
KXXEXXXY

The difference is obvious: we had to write several consensus sequences instead of one regular expression. The advantage of consensus sequences is however that they can be used simply as queries with fast database search programs like FASTA and FASTDB, so the experimenter can very easily verify if a pattern is shared by a biologocially meaningful group of sequences. In this case the answer is a score value, not a simple yes or no, like with regular expressions. Consensus sequences are very useful in searching for long and loosely defined patterns. On the other hand, short patterns, like the ones in the previous example can be better searched with regular expressions.

Temple Smith and Randy Smith use an extended terminology for consensus sequences that allows the use of variable residues and and gaps. Their PIMA program automatically generates "complete" consensus patterns for predefined groups of sequences. They constructed a comprehensive consensus sequence library called PLSEARCH from all the known protein superfamilies which can be used to locate regions of similarity in a newly determined sequence (see below).

Profiles as representations of similarity groups

Profiles are specific representations that incorporate the entire information of a multiple alignment. The simplest forms of profiles can be pictured as an averaged representation of all sequences in a multiple alignment. This averaged "phantom sequence" representation can then be used as a query in database search. The trick is simple. We have to remember that in alignments (or database searching) each amino acid type has a column in the replacement matrix (e.g. the Dayhoff matrix) which contains the 20 cost values for replacing it with any of the 20 other amino acids. These values are independent on the position of an amino acid within the sequence. Using a multiple alignment, one can make a sequence-dependent replacement matrix, and this will be the "profile". This can be demonstrated on the simple example already shown above:

1 5 10 sequence 1 CPKICIGGWFA sequence 2 CSGICKKAWFG

For position 1 we will use the C column of the replacement matrix. for position 2, we will use an average of the P and S columns. For position 3, we will use the average of the K and S column, etc. Positions 9 and 10 are conserved, so there we will use the original W and F columns of the replacement matrix, respectively. By doing this transformation, we will end up with a new matrix, which will have 20 rows (the original row number of the matrix) but 11 columns, i.e. as many colunms as is the length of the alignment. This is a simple profile that can be used in dabase search programs with little modifications. Naturally, one can use a variety of other, much more sophisticated methods to build profiles.

The bias of patterns and profiles

Patterns and profiles are average representations, and all averages are biased. In our case it means that the pattern will faithfully reflect the properties of the similarity gorup we are working with, and it can be severely biased if the group is not properly selected. For example, fibronectin type 3 repeats (FN3-repeats) have two major types, the one occuring in fibronectins FN3A and the ones occuring in cell adhesion molecules FN3B. FN3A repeats were known much before and in much greater numbers. FN3B repeats were discovered later. At a given point of time, the database contained 120 FN3A and 10 FN3B repeats. If one built separate patterns for the FN3A group, the pattern recognized only a few of the FN3B group (vica versa, the FN3B pattern recognized only some of the FN3As). If one combines all 130 sequences in one group for building a pattern, the resulting pattern mainly reflected the properties of the FN3A, i.e. the majority group. This pattern recognizes the FN3A group much better, and hardly recognizes the FN3B group. To this problem there is no theoretical solution. One practical solution is to take an equal number of the "representative groups". In this case one could keep all FN3Bs and add 10 "well selected" members of the FN3A family. In fact, such a strategy allows quite frequently to find new domain examples in the database.

There are a number of problems with selecting a good set, however, i) The collection contains repeats of the same protein and these are sometimes more closely related to each other, than to repeats of other families. So it might be better to select one repeat from every protein. ii) Sometimes the serial order of the repeats is important. For example, repeat 1 in each protein is highly homologous among the proteins, repeat 2 is not etc. So it might make sense to select one of each group.

Collections of similarity groups: domain databases

A strategy to avoid the bias of patterns is the use of a domain library, such as SBASE and PRODOM. Instead of consensus representations, SBASE contains several copies for known functional domains, so spurious homologies to individual members of a homology groups be picked up by simple database search. Searching this domain library with a program like BLAST or FASTA can give you direct information on functional homologies to segments of your protein sequence (see below). With this approach, the similarity group itself is used as a diagnostic tool.

There are two ways to build domain databases. SBASE is a collection of domains annotated in sequence databases. The domain boundaries are man made and correspond to some established function or structure. (In fact SBASE contains functional, structural, cellular topology-directed domains as well as simple repeats found in various proteins). The advantage of this approach is that it gives a clear indication of the information known about a certain domain type. PRODOM is a more recent database based on the Swiss-Prot database. It is a comprehensive collection of regions similar to each other, prepared by aligning each entry in Swiss-Prot with each other entry. This is a machine-based approach that usually gives similarity regions that do not coincide with the known boundaries of domains, modules etc. Since it is machine-made, it can also contain quite novel similarity groups.