Protein Sequence Analysis Program -- Multiple Sequence Alignment by Parallel Iterative Aligner -- ABSTRACT Every protein is a chain of twenty kinds of amino acids. Its structure and function are determined by the sequence of these amino acids. Protein sequence analysis is vital to the prediction of protein structure and function, and to the analysis of the evolutional relationship among proteins. We built the "Parallel Iterative Aligner" for this analysis using a PIM machine. It solves a typical protein sequence analysis problem, that of multiple sequence alignment. This kind of combinatorial problem generally requires a large amount of computation. Our Parallel Iterative Aligner, however, can quickly solve practical-sized problems, producing high-quality answers. KEY FEATURES Iterative Improvement The system partially improves temporary alignment in an iterative way, and can effectively reach high-quality alignment as a result. Speed-up by Parallel Execution Multiple branches of a search tree in this combinatorial problem are eval- uated in parallel by using many processing elements in each iteration. Pruning of Search Tree The heuristic method, "Restricted Partitioning Technique," prunes a large number of branches in the search tree and makes it possible to solve the combinatorial problem in a practical amount of time. Operation of Parallel Iterative Aligner - 57 - What is protein sequence analysis? Proteins are made up of twenty kinds of amino acids, which we distinguish by twenty different code letters. A protein has about two hundred amino acids on average and is represented by a sequence of code letters. Because every amino acid has its own properties of volume, hydrophobicity, polarity and so on, the order of the amino acids in the protein sequence gives the structure and function of the protein. The protein sequence determination technique has been established for so long that more than twenty thousand sequences have been specified by the letters; this number is growing day by day. The structures of proteins are also being solved. Methods such as X-ray crystallography reveal how a chain of amino acids folds together. But these methods take many months to complete, so only three hundred protein structures have been determined so far. An important way of discovering new biological information is by inferring the unknown structure of a protein from its sequence. We do this by ana- lyzing the sequence of amino acids, because, fortunately, proteins that have similar sequences have similar structures. Multiple sequence alignment is one of the most typical methods of sequence similarity analysis. The alignment of several protein sequences can provide valuable information for researching the function or structure of proteins, especially if one of the aligned proteins has been well characterized. Let us show an example of multiple sequence alignment. The following set of sequences represents an alignment of six different protein sequences. HEKL stands for a row of Histidine, Glutamic acid, Lysine and Leucine. Each sequence is shifted by inserting gaps (dash characters). Each column of the resultant alignment has the same or similar amino acids. An identical pattern such as H . . . . H and C . . C is considered to be an important site called a sequence motif, or simply a motif, because an important protein sequence site has been conservative along with evolutional cycles between mutation and natural selection. Multiple sequence alignment is useful not only for inferring the structure and function of proteins but also for drawing a phylogenetic tree along the evolutional histories of the creatures. Dynamic programming on sequence alignment Dynamic programming (DP) is a basic method to find an optimal align- ment. The method is regarded as the best path search in the N-dimensional network. In the method, if two groups of sequences are given, a two-dimensional network that has a number of nodes connected by arrows is formed (Figure - 58 - 1). A score is assigned to each arrow. We search a path from the top left node to the bottom right node, maximizing the total score of the arrows. In this case, the set of arrows that connect black circle nodes is the best path. This best path corresponds to the optimal alignment. Scores on arrows should reflect similarity between compared characters. In the case of protein sequence alignment, Dayhoff's odds matrix is the most popular way of obtaining the scores. The matrix was obtained by statistical analysis of the mutation probability of amino acids. Figure 1 Two-dimensional DP Theoretically, N-dimensional DP provides an optimal alignment of N groups of sequences. However, N-dimensional DP operates in exponential time as N grows. When N is more than three, it does not complete in a realistic time frame. Conventionally, researchers in the biological field make a multiple se- quence alignment by merging groups of aligned sequences. A conventional algorithm, called the tree-based algorithm, merges them in tree-like order us- ing two-dimensional DP. Though the execution time of the conventional algorithm is manageable, the quality of its resultant alignment is not high enough yet. Thus, researchers fairly often have to do multiple sequence alignments by hand. The large number of sets of sequences to be aligned have become a burden on those researchers. Parallel iterative aligner We developed a parallel iterative aligner, whose execution will be demon- strated, in order to improve the quality of automatic multiple sequence align- ment. The algorithm of this parallel iterative aligner is based on the Berger- Munson algorithm. Firstly, we introduce the B-M algorithm, and then we explain the parallel iterative aligner. The B-M algorithm features a novel randomized iterative strategy so as to generate a high-score multiple sequence alignment. Figure 2 illustrates the iterative strategy, whose procedure is as follows: the initially aligned sequences are randomly divided into two groups. By fixing the alignment of - 59 - sequence members within each group, we can optimize the alignment between the groups, using two-dimensional DP. The resultant alignment, in turn, is the starting point for the next alignment of a different pair of groups. Each iteration that improves the alignment between two sequence groups will also improve the global alignment. Figure 3 Parallel B-M algorithm This iterative strategy often results in much better multiple alignments than those obtained by conventional algorithms. In Figure 4, (b) shows the performance when this B-M algorithm is applied to seven sequences. Lines (b-1), (b-2) and (b-3) are different in respect to random numbers. The quality of the results is superior to that obtained by the tree-based algorithm, shown as a horizontal line (a). However, the B-M algorithm needs a large amount of time as the number of sequences grows. We can reduce the execution time, when a parallel machine is available. Figure 3 shows the algorithm of our parallel iterative aligner. Every possible partitioning into two groups of aligned sequences can be respectively evaluated by two-dimensional DP in a parallel way. In each iteration, the evaluation is executed in parallel and the alignment which has the best score is selected as the starting point for the next iteration. The performance of this method is shown as line (c) in Figure 4. This shows that the parallel iterative aligner performs better than the original B-M method in terms of execution time. - 60 - Figure 4 Performance comparison Furthermore, we have developed an effective heuristic search, restricted partitioning technique. Applying the iterative strategy, we realized that the number of sequences in the divided groups is important. As partitioning divides N sequences into k sequences and N-k sequences, a smaller k tends to provide a larger improvement when using two-dimensional DP. The restricted partitioning technique preferentially selects partitionings which have a small k such as one or two. It can restrict the search space and reduce the execution time remarkably. Parallel iterative aligners with this technique can manage more sequences at the same time than those without it. Demonstration The demonstration system solves an alignment of 22 sequences using the parallel iterative aligner. It is executed on PIM/m, which consists of 256 processing elements (PEs). We use the restricted partitioning technique; the number of sequences in the smaller divided group is restricted to one or two. 253 PEs (22C1 = 22 plus 22C2 = 231) are necessary to execute all restricted partitioning in parallel. Additionally, one processor is used for a manager process which selects the best alignment in every iteration. In total, 254 PEs are employed in this demonstration. The demonstration shows gradual improvement of the alignment in each iteration. Code letters for amino acids in the display are colored according to their properties (see next page). The score for each column using a bar graph under the alignment is also shown. High-score columns, indicated by tall bars, correspond to well-conserved sites in evolution. Representative sequence patterns in the sites are regarded as motifs. In the final resultant data of this demonstration, we can recognize a number of motifs. One of them is a famous motif, the site of which binds ATP. - 61 -