Let's suppose we compare 10 sequences in all possible pairings and
in three cases we find a) a high similarity score and b) a similar pattern
of alignment, e.g.:C . . . H . . . C Is this pattern important? The first thing to do is a multiple alignment using the three candidate
sequences. In this multiple alignment we may discover new sequence identities
that we may have overlooked previously, e.g. we discover that the shared
pattern is C . . . H . . . C . . C (Some of you may have noticed that this example is the pattern in zinc-fingers.)
But is this pattern important??? First we have to look at the pattern using the rules described above
(frequent residues, repetitive sequences, structurally important residues).
This example is seemingly OK, we have cysteines (perhaps structurally important)
plus a histidine which is not extremely frequent. Second, we have to ascertain, if this pattern is not present in any
other sequences. So we test out database of 10 and find that it is in fact
absent from the other sequences. Third, and most important, we have to find out whether or not this pattern
can have a biological meaning at all. If all three sequences are of the
same function, e.g. DNA-binding proteins, on can think that the pattern
has to do with the function. But let's not be overconfident. Leucine-zippers
are parts of DNA-binding proteins, but they mediate dimerization and are
not directly involved in binding DNA. So, it is not possible to firmly
establish a biological function in most cases. The solution can be "statistical": If we have a "sufficiently
big" number of examples in which a biological function is associated
with a certain sequence pattern, we can feel encouraged. It is equally
important that the pattern should not be found in those sequences which
do not carry the same biological function. In a different case, we may
also have strong biological clues. For example we may know that 3 protein
sequences are involved in the same biological function in different organisms.
In this case, the group is given in advance and we are looking for a sequence
pattern that may be used to characterize the group, and moreover, may be
responsible in the biological function. Let's say we compare our above pattern to two groups of sequences, i)
DNA-binding proteins and ii) non-DNA-binding proteins. We find that the
DNA-binding proteins always have a score greater than 10 and the non-DNA-binding
proteins have scores less than 5. We are now ready with the rule: any sequence
producing a similarity score greater than, say, 7 in comparison with the
pattern, may belong to the group of DNA-binding proteins. In other terms,
we define a threshold value. This rule may work, but in real life situations
we usually have complications, like: i) DNA-binding proteins that score below 7. We call these "false
negatives" because they scored negative in the test but are in fact
positive. ii) Non-DNA-binding proteins that score above 7. We call these 'false
positives" since they scored positive but are in fact negative. In fact, for a "perfect pattern" we require that it should
occur only in the "true positive" sequences (that really carry
the function in question), and in none of the "true-negative"
sequences (which do not carry the function). Also, we do not want "false
positives" and "false negatives". Real-life situations are unfortunately even more complicated. usually
we have a small group (a "test group") of sequences of which
we suspect they may carry a common function. We establish a pattern and
then use some rudimentary procedure to calculate a "similarity score"
for that pattern. We use the entire database as the other group. Using
the Student principle for comparing two groups (characterized each by an
average and standard deviation), one can calculate the statistical significance
of the difference:
If this value if big, the separation is "mathematically significant",
so we can trust the pattern better. It is easier to depict graphically
the situation, as shown in figure |