Let's suppose you are interested in finding known patterns or known
functional similarities in the following sequence region:
LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM
This is relatively easy. You shold first submit your sequence to one
of the pattern database servers that contain collections of known patterns
in various forms. You do not need to worry about the formats, since you
can submit your sequence "as is", and the search, including the
format, is an "internal affair" of the server.
Use the following servers: (PROSITE,
PRINTS,
BLOCKS, ISREC
PROFILES)
Next you may try protein domain collections which contain either `annotated
functional domains, like the SBASE protein library, or machine-annotated
and automatically generated similarity groups, like the PRODOM database.
Use the fololowing servers: (SBASE,
PRODOM)
Independently from the result, try to
This part is more complex, so we give concrete examples. Let's suppose
you are still interested in finding hitherto unknown patterns in the following
sequence region:
LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM
The steps are as follows:
You can start with any database search and build up manually the starting
list of similarity regions from the search results. The best solution is
to search a daily updated database Most of the daily updated searches use
BLAST, and it takes some patience to build up the similarity regions from
the many segments found in BLAST results. Full Smith-Waterman searches
are preferable, however they are seldom installed on daily updated databases
For the simplicity's sake we give here an example based on a FASTA search.
FASTA searches are not optimal, however they are reasonably fast and, more
importantly they contain a single alignment region. So it is a good compromise.
The search is available for example at:
You can also use the GCG FASTA program. If you choose to do that, you
will need to prepare an input file in the FASTA format.
If you use the Swiss-Prot database either with GCG
or with a server, you will receive a search output, with alignments like
this
From the search output you will have to extract the alignment regions,
according to the empirical rules mentioned before during this course. You
need to select sequences that are "divergent enough" so as to
make a good pattern.
Let's suppose you extract the following group of homologous sequence
regions from a database search:
>1 (your query) LHKGIMVNGVDEATILDLLTKKYNAQRHHLKAVYIQETGEPLDETLKKALTGHIQELLLAM >2 LHAAMAGIGTEEATLVEILCTKTNEEMAQIVAVYEERYQRPLAEQMCSETSGFFRRLLTLI >3 LHAAMKGLGTDENALIDILCTQSNAQIHAIKAAFKLLYKEDLEKEIISETSGNFQRLLVS >4 LRACMKGHGTDEDTLIEILASKKNNKEIREACRYYKEVLKRDLTQDIISDTSGDFQKALVSL >5 IHSACAGAGTNENTIIEILVTKNVQMEYIKQIFKNKHGKSLKDRLESEASGDFKKLLEKL >6 YEAGELKWGTDEAQFIYILGNKSKQHLRLVFDEYLKTTGKPIEASIRGELSGDFEKLMLAV
These sequences are in the FASTA format.
You submit the above file to the BCM multiple alignment server
http://dot.imgen.bcm.tmc.edu:9331/multi-align/multi-align.html
You have several algorithm that all give you multiple alignments. You
can use these to build a pattern heuristically. If you do not know how
to build a pattern use the PIMA algorithm, because it gives you a pattern
in addition to the multiple alignment. With complex sequences this is a
great advantage.
For the file above you will get the following output within the WWW
session
The pattern is in a specific form, but you can use the alphabet defined
here, in order to generate a pattern in a more general format, e.g. the
PROSITE or regular expression format.
First extract the pattern, use pat-SB from the output:
1 pat-SB ----GXmEXXf cXcLXrgggrXoggggXXfgggdXXXXXpX cXXrcXXXXiGXfpp Xb---
Since this pattern is very long, we use only the underlined part
as an example.
-2 "x"
________________ X __________________ 0 " "
/ / \ \
__ f __ / ______r _______ \ 1 "."
/ / \ / / / \ \ \
/ c \ e / m p \ _ j __ 2 ":"
/ / \ \ / \ / / \ / \ \ / \ \
/ a b d \ / l k o n i h \ 3 "!"
/ / \ / \ /|\ \ / / \ / \ / \ /\ / \ / \ \
C I V L M F W Y H N D E Q K R S T A G P 5 "|"
The syntax of PROSITE is shown here.
If you select 'regular expression' format, it uses the full syntax of regular
expressions defined in PERL.A http reference list can be found under
GXmEXXfcXcLXrgggrXoggggXXfgggd
GX[m]EXX[f][c]X[c]LX[r]ggg[r]X[o]ggggXX[f]ggg[d]
c --> IVLM f --> CIVLMFWY m --> NDE d --> FWY o --> EQ
You can do the replacements with the "Replace" option of a
word processor. do the replacements in a case sensitive way, e.g. using
the "Match case" option of Microsoft Word. In this way you avoid
replacing the capital letters....
You can make all replacements from the short list above, or may choose
to use X for the larger groups. This makes the pattern simpler even though
it may become less selective. In this case we will use X for r and f. We
get the following pattern:
GX[NDE]EXXX[IVLM]X[IVLM]LXXgggXX[EQ]ggggXXXggg[FWY]
XXgggXX --> X(4,7) ggggXXXggg --> X(3,10)
at the same time you may replace all XXXX series with numbers e.g. X(4)
You get this long pattern:
GX[NDE]EX(3)[IVLM]X[IVLM]LX(4,7)[EQ]X(3,10)[FWY]
Finally, you add the hyphens "-" to make a correct PROSITE
format:
G-X-[NDE]-E-X(3)-[IVLM]-X-[IVLM]-L-X(4,7)-[EQ]-X(3,10)-[FWY]
Note: You can also use the regular expression form
of patterns.
Note: BE CAREFUL. IF YOU MAKE A MISTAKE IN THE
SYNTAX, THE RESULT WILL BE EMPTY, WITH NO EXPLANATIONS! E.g. a single missing
"-" may cause your search to fail...
(While constructing this exercise I made 5 mistaken searches because of
missing "-"-s.
You submit the above pattern to the ISREC pattern search facility, by
copy/paste.
This server gives and answer by e-mail, so you will have to give your
e-mail address. Select the motifs option and use the Swiss-Prot database.
(With your real patterns you should first scan the Swiss-Prot database
and then proceed to the nonredundant database.)
You will receive the following answer by e-mail:
>ANX1_BOVIN:59 GVDEATIIEILTKRNNAQRQQIKAAY >ANX1_CAVCU:59 GVDEATIIDILTKRNNAQRQQIKAAY >ANX1_CAVCU:131 GTDEDTLIEILVSRKNREIKEINRVY >ANX1_COLLI:126 GTDEDTLIEILASRNNKEIREACRYY >ANX1_GEOCY:59 GVDEATILDLLTKRYNAQRHHLKAVY >ANX1_GEOCY:131 GTDEETLIEILWTRSNQQIREITSVY >ANX1_HUMAN:58 GVDEATIIDILTKRNNAQRQQIKAAY >ANX1_HUMAN:130 GTDEDTLIEILASRTNKEIRDINRVY >ANX1_MOUSE:58 GVDEATIIDILTKRTNAQRQQIKAAY >ANX1_MOUSE:130 GTDEDTLIEILTTRSNEQIREINRVY >ANX1_RAT:58 GVDEATIIDILTKRTNAQRQQIKAAY >ANX1_RAT:130 ....
You have a great number of sequences. This is suspicious....But the
solution is simple! If you look up the sequences on the Swiss-Prot serveryou
can find out that all these sequences all contain ANNNEXIN-REPEAT motifs.
In fact, the example was made of ANNEXIN-REPEATs.
You can now try to build extended pattern and scan the nonredundant
database with it. Here you can discover new ANNEXIN-REPEATS in sequences
not yet in Swiss-Prot. You can identify these by looking for the non-Swiss-Prot
identifiers in the output!
For example, a pattern variant extended at the C-terminal end of the
previous pattern looks like this:
G-X-[NDE]-E-X(3)-[IVLM]-X-[IVLM]-L-X(4,7)-[EQ]-X(3,10)-[FWY]-X(5)-[EQKR]
Here we used "X" for the group "f". Submitting it
to Swiss-Prot we get identical results (not shown). This means that the
first pattern was already good enough...
So we can now scan the updated nonredundant database for new annexins. We do this on the same server, using the nonredundant option.
We use the sequences options since we would like to collect the new
sequences...