Document for FGCS'92 Demonstrations Documents P.58

What is protein sequence analysis? 

Proteins are made up of twenty kinds of amino acids, which we distinguish 
by twenty different code letters. A protein has about two hundred amino acids 
on average and is represented by a sequence of code letters. Because every 
amino acid has its own properties of volume, hydrophobicity, polarity and so 
on, the order of the amino acids in the protein sequence gives the structure 
and function of the protein. 

The protein sequence determination technique has been established for so 
long that more than twenty thousand sequences have been specified by the 
letters; this number is growing day by day. The structures of proteins are also 
being solved. Methods such as X-ray crystallography reveal how a chain of 
amino acids folds together. But these methods take many months to complete, 
so only three hundred protein structures have been determined so far. 

An important way of discovering new biological information is by inferring 
the unknown structure of a protein from its sequence. We do this by ana-
lyzing the sequence of amino acids, because, fortunately, proteins that have 
similar sequences have similar structures. Multiple sequence alignment is one 
of the most typical methods of sequence similarity analysis. The alignment 
of several protein sequences can provide valuable information for researching 
the function or structure of proteins, especially if one of the aligned proteins 
has been well characterized. 

Let us show an example of multiple sequence alignment. The following set 
of sequences represents an alignment of six different protein sequences. HEKL
stands for a row of Histidine, Glutamic acid, Lysine and Leucine. 





Each sequence is shifted by inserting gaps (dash characters). Each column 
of the resultant alignment has the same or similar amino acids. An identical 
pattern such as H . . . . H and C . . C is considered to be an important site called 
a sequence motif, or simply a motif, because an important protein sequence 
site has been conservative along with evolutional cycles between mutation and 
natural selection. Multiple sequence alignment is useful not only for inferring 
the structure and function of proteins but also for drawing a phylogenetic tree 
along the evolutional histories of the creatures. 

Dynamic programming on sequence alignment 

Dynamic programming (DP) is a basic method to find an optimal align-
ment. The method is regarded as the best path search in the N-dimensional 
network. In the method, if two groups of sequences are given, a two-dimensional 
network that has a number of nodes connected by arrows is formed (Figure 


					- 58 -