What is protein sequence analysis?
Proteins are made up of twenty kinds of amino acids, which we distinguish
by twenty different code letters. A protein has about two hundred amino acids
on average and is represented by a sequence of code letters. Because every
amino acid has its own properties of volume, hydrophobicity, polarity and so
on, the order of the amino acids in the protein sequence gives the structure
and function of the protein.
The protein sequence determination technique has been established for so
long that more than twenty thousand sequences have been specified by the
letters; this number is growing day by day. The structures of proteins are also
being solved. Methods such as X-ray crystallography reveal how a chain of
amino acids folds together. But these methods take many months to complete,
so only three hundred protein structures have been determined so far.
An important way of discovering new biological information is by inferring
the unknown structure of a protein from its sequence. We do this by ana-
lyzing the sequence of amino acids, because, fortunately, proteins that have
similar sequences have similar structures. Multiple sequence alignment is one
of the most typical methods of sequence similarity analysis. The alignment
of several protein sequences can provide valuable information for researching
the function or structure of proteins, especially if one of the aligned proteins
has been well characterized.
Let us show an example of multiple sequence alignment. The following set
of sequences represents an alignment of six different protein sequences. HEKL
stands for a row of Histidine, Glutamic acid, Lysine and Leucine.
|
Each sequence is shifted by inserting gaps (dash characters). Each column
of the resultant alignment has the same or similar amino acids. An identical
pattern such as H . . . . H
and C . . C
is considered to be an important site called
a sequence motif, or simply a motif, because an important protein sequence
site has been conservative along with evolutional cycles between mutation and
natural selection. Multiple sequence alignment is useful not only for inferring
the structure and function of proteins but also for drawing a phylogenetic tree
along the evolutional histories of the creatures.
Dynamic programming on sequence alignment
Dynamic programming (DP) is a basic method to find an optimal align-
ment. The method is regarded as the best path search in the N-dimensional
network. In the method, if two groups of sequences are given, a two-dimensional
network that has a number of nodes connected by arrows is formed (Figure
- 58 -