 |
Let's say we obtained the highest similarity score s by comparing
a query sequence with one member of a sequence database. To determine the
matemathical significance, we have to calculate the similarity score between
the query and all members of the database, and to determine the average and
the standard deviation of the
scores.
The measure of mathematical significance is the so-called Student t
value, that allows to compare a single value with a database mean value
calculated from n entries:

This quantity expresses the distance between a score from the
average, in units of standard deviation. The higher the value, the higher
the significance. The significance is characterized by a value taken from
the Student table for a given t and a given sample number (the number
of the database members). The table contains probability values that are
referred to as "significance levels" . A value of 0.005 (quoted
in writing often as p<0.005) means that the probability to find score
s by chance is smaller than 0.5%. "Unique" sequences may have
much smaller probabilities, and the probabilities are automatically calculated
by most database searching programs. It is worth mentioning that the Student
test assumes that the distribution of scores is random (Gaussian), which
is only approximately true. The test was developed by an Englishman, William
Gosset, who published under the pseudonym Student in the 19th century.
|