FINDING AND ESTABLISHING PATTERNS IN A NEW SEQUENCE

Let's suppose you are interested in finding known patterns or known functional similarities in the following sequence region:

LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM

A) FIND KNOWN PATTERNS AND FUNCTIONAL HOMOLOGIES

This is relatively easy. You shold first submit your sequence to one of the pattern database servers that contain collections of known patterns in various forms. You do not need to worry about the formats, since you can submit your sequence "as is", and the search, including the format, is an "internal affair" of the server.

Use the following servers: (PROSITE, PRINTS, BLOCKS, ISREC PROFILES)

Next you may try protein domain collections which contain either `annotated functional domains, like the SBASE protein library, or machine-annotated and automatically generated similarity groups, like the PRODOM database.

Use the fololowing servers: (SBASE, PRODOM)

Independently from the result, try to

B) FIND NEW PATTERNS IN A SEQUENCE

This part is more complex, so we give concrete examples. Let's suppose you are still interested in finding hitherto unknown patterns in the following sequence region:

LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM

The steps are as follows:

1) DATABASE SEARCH

You can start with any database search and build up manually the starting list of similarity regions from the search results. The best solution is to search a daily updated database Most of the daily updated searches use BLAST, and it takes some patience to build up the similarity regions from the many segments found in BLAST results. Full Smith-Waterman searches are preferable, however they are seldom installed on daily updated databases

For the simplicity's sake we give here an example based on a FASTA search. FASTA searches are not optimal, however they are reasonably fast and, more importantly they contain a single alignment region. So it is a good compromise. The search is available for example at:

http://www.ebi.ac.uk/searches/fasta.html

You can also use the GCG FASTA program. If you choose to do that, you will need to prepare an input file in the FASTA format.

If you use the Swiss-Prot database either with GCG or with a server, you will receive a search output, with alignments like this

From the search output you will have to extract the alignment regions, according to the empirical rules mentioned before during this course. You need to select sequences that are "divergent enough" so as to make a good pattern.

Let's suppose you extract the following group of homologous sequence regions from a database search:

>1 (your query)
LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM
>2
LHAAMAGIGTEEATLVEILCTKTNEEMAQIVAVYEERYQRPLAEQMCSETSGFFRRLLTLI
>3
LHAAMKGLGTDENALIDILCTQSNAQIHAIKAAFKLLYKEDLEKEIISETSGNFQRLLVS
>4
LRACMKGHGTDEDTLIEILASKKNNKEIREACRYYKEVLKRDLTQDIISDTSGDFQKALVSL
>5
IHSACAGAGTNENTIIEILVTKNVQMEYIKQIFKNKHGKSLKDRLESEASGDFKKLLEKL
>6
YEAGELKWGTDEAQFIYILGNKSKQHLRLVFDEYLKTTGKPIEASIRGELSGDFEKLMLAV

These sequences are in the FASTA format.

2) MULTIPLE ALIGNMENT

You submit the above file to the BCM multiple alignment server

http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html

You have several algorithm that all give you multiple alignments. You can use these to build a pattern heuristically. If you do not know how to build a pattern use the PIMA algorithm, because it gives you a pattern in addition to the multiple alignment. With complex sequences this is a great advantage.

For the file above you will get the following output within the WWW session

3) BUILDING A PROSITE PATTERN FROM A PIMA MULTIPLE ALIGNMENT OUTPUT

The pattern is in a specific form, but you can use the alphabet defined here, in order to generate a pattern in a more general format, e.g. the PROSITE or regular expression format.

First extract the pattern, use pat-SB from the output:

1 pat-SB ----GXmEXXf cXcLXrgggrXoggggXXfgggdXXXXXpX cXXrcXXXXiGXfpp Xb---

Since this pattern is very long, we use only the underlined part as an example.

You see there are very few conserved positions, and the ones you find are not very "useful" (structurally important or rare) amino acids. So you will have to incorporate as much as possible of the not completely aligned positions. These are denoted by small letters "g" is gap!
You either construct a PROSITE-like pattern manually, from the multiple alignment, or use the pattern given by the program. You can do this as shown below, using the alphabet given in the tree-form :
Use the pattern alphabet of the PIMA program:

                                                            -2  "x"

          ________________ X __________________              0  " "
         /          /           \              \
     __ f __       /       ______r _______      \            1  "."
    /  /    \     /       /   /     \     \      \
   /  c      \   e       /   m       p     \    _ j __       2  ":"
  /  / \      \ / \     /   / \     / \     \  /   \  \
 /  a    b     d   \   /   l   k   o    n     i     h  \     3 "!"
/  / \  / \   /|\   \ /   / \ / \ / \   /\   / \   / \  \
C  I V  L M  F W Y   H   N   D   E   Q  K R  S T   A G   P   5 "|"

The syntax of PROSITE is shown here. If you select 'regular expression' format, it uses the full syntax of regular expressions defined in PERL.A http reference list can be found under

http://www.iihe.ac.be/lll/Recu/various/perlregexp.html

You build the pattern e.g. starting from the first conserved residue. You can build the pattern heuristically or by following a procedure outlined here. First copy the conserved residues and the X-es, and then replace each small letter using copy/paste with the group corresponding to it in the alphabet above. For example, "m" covers any of "NDE". So you copy "N D E" from the diagram above, delete the spaces and put it into square brackets. This is essentially a word-processing exercise.In detail:

Take the pattern and delete the spaces (they do not have a meaning here..)

GXmEXXfcXcLXrgggrXoggggXXfgggd

Put all small letters except "g" into square brackets

GX[m]EXX[f][c]X[c]LX[r]ggg[r]X[o]ggggXX[f]ggg[d]

Replace small letters with groups found in the "tree" above. Since you have many c-s in this case, it is useful to compare the group first: You have these groups in the pattern:

c --> IVLM
f --> CIVLMFWY
m --> NDE
d --> FWY
o --> EQ

You can do the replacements with the "Replace" option of a word processor. do the replacements in a case sensitive way, e.g. using the "Match case" option of Microsoft Word. In this way you avoid replacing the capital letters....
You can make all replacements from the short list above, or may choose to use X for the larger groups. This makes the pattern simpler even though it may become less selective. In this case we will use X for r and f. We get the following pattern:

GX[NDE]EXXX[IVLM]X[IVLM]LXXgggXX[EQ]ggggXXXggg[FWY]

Replace the gaps (underlined above) with numbers of X-s. E.g.:

XXgggXX --> X(4,7)
ggggXXXggg --> X(3,10)

at the same time you may replace all XXXX series with numbers e.g. X(4)
You get this long pattern:

GX[NDE]EX(3)[IVLM]X[IVLM]LX(4,7)[EQ]X(3,10)[FWY]

Finally, you add the hyphens "-" to make a correct PROSITE format:

G-X-[NDE]-E-X(3)-[IVLM]-X-[IVLM]-L-X(4,7)-[EQ]-X(3,10)-[FWY]

Note: You can also use the regular expression form of patterns.

Note: BE CAREFUL. IF YOU MAKE A MISTAKE IN THE SYNTAX, THE RESULT WILL BE EMPTY, WITH NO EXPLANATIONS! E.g. a single missing "-" may cause your search to fail...
(While constructing this exercise I made 5 mistaken searches because of missing "-"-s.

4) PATTERN SEARCH; VERIFICATION AND REFINEMENT OF THE PATTERN

You submit the above pattern to the ISREC pattern search facility, by copy/paste.

http://ulrec3.unil.ch/software/PATFND_mailform.html

This server gives and answer by e-mail, so you will have to give your e-mail address. Select the motifs option and use the Swiss-Prot database. (With your real patterns you should first scan the Swiss-Prot database and then proceed to the nonredundant database.)

You will receive the following answer by e-mail:

>ANX1_BOVIN:59
GVDEATIIEILTKRNNAQRQQIKAAY
>ANX1_CAVCU:59
GVDEATIIDILTKRNNAQRQQIKAAY
>ANX1_CAVCU:131
GTDEDTLIEILVSRKNREIKEINRVY
>ANX1_COLLI:126
GTDEDTLIEILASRNNKEIREACRYY
>ANX1_GEOCY:59
GVDEATILDLLTKRYNAQRHHLKAVY
>ANX1_GEOCY:131
GTDEETLIEILWTRSNQQIREITSVY
>ANX1_HUMAN:58
GVDEATIIDILTKRNNAQRQQIKAAY
>ANX1_HUMAN:130
GTDEDTLIEILASRTNKEIRDINRVY
>ANX1_MOUSE:58
GVDEATIIDILTKRTNAQRQQIKAAY
>ANX1_MOUSE:130
GTDEDTLIEILTTRSNEQIREINRVY
>ANX1_RAT:58
GVDEATIIDILTKRTNAQRQQIKAAY
>ANX1_RAT:130
....

You have a great number of sequences. This is suspicious....But the solution is simple! If you look up the sequences on the Swiss-Prot serveryou can find out that all these sequences all contain ANNNEXIN-REPEAT motifs. In fact, the example was made of ANNEXIN-REPEATs.

You can now try to build extended pattern and scan the nonredundant database with it. Here you can discover new ANNEXIN-REPEATS in sequences not yet in Swiss-Prot. You can identify these by looking for the non-Swiss-Prot identifiers in the output!

For example, a pattern variant extended at the C-terminal end of the previous pattern looks like this:

G-X-[NDE]-E-X(3)-[IVLM]-X-[IVLM]-L-X(4,7)-[EQ]-X(3,10)-[FWY]-X(5)-[EQKR]

Here we used "X" for the group "f". Submitting it to Swiss-Prot we get identical results (not shown). This means that the first pattern was already good enough...

So we can now scan the updated nonredundant database for new annexins. We do this on the same server, using the nonredundant option.

We use the sequences options since we would like to collect the new sequences...