Proc. of FGCS '94, ICOT, Tokyo, December 1994

Parallel Basic Software

Takashi Chikayama
Institute for New Generation Computer Technology
4-28, Mita 1-chome, Minato-ku, Tokyo 108 Japan
chikayama@icot.or.jp

Index

0. Abstract
1. Introduction
2. Evaluation of PIMs
3. KLIC: A Portable KL1 Implementation
4. Distributed Pool
5. Kappa: A Parallel Database Management System
6. MGTP: A Bottom-Up Theorem Prover
7. Conclusion
8. References
9. All of this proceedings

Abstract:

Results of the research and development conducted on parallel basic software in the FGCS follow-on project are reported. The principal objective of the project is in dissemination of the technologies developed in the FGCS main project based on concurrent logic programming.

To reach the objective, the software systems developed in the FGCS project on parallel inference machines (PIMs) required a more widely available platform for their broader utilization. PIMs were evaluated in further depth and a new implementation of a concurrent logic programming language KL1 for Unix-based systems, KLIC, was developed based upon the results. Principal basic software systems on PIM were ported to KLIC, including a parallel database management system Kappa and a parallel bottom-up theorem prover MGTP.

1 Introduction

In the Fifth Generation Computer Systems (FGCS) project conducted in fiscal years of 1982 through 1992, its parallel inference system was developed to provide massive processing power with a comfortable software development environment based on concurrent logic programming. The parallel inference system consists of parallel inference machines [7], parallel implementations of a concurrent logic programming language KL1 upon them and basic software systems [2][17].

Through development of numerous experimental parallel application software systems, the parallel inference system was proved to provide superior processing power and a comfortable environment for efficient software development. However, the system based on special purpose hardware had some drawbacks that prevented wider utilization of the system and application systems on it.

The KL1 language system and basic software systems have been redesigned and reconstructed on computer systems commonly available in the market for wider availability and easier access. This paper gives an overview of the research and development on this area in the FGCS follow-on project.

1.1 Motivation and Objectives

The parallel inference system built in the FGCS project had the following drawbacks.
These had been obstacles for broader utilization of the software developed in the FGCS project.

To get rid of these obstacles, the following had to be provided.

1.2 R&D Overview

The requirements described above led us to design a new implementation of KL1 for Unix-based computer systems, named KLIC.

For higher portability, KLIC uses the language C as an intermediate language for compilation of KL1 programs, rather than going down directly to the machine instruction or the microprogram level as was done with PIMs. However, it is not the only difference between KLIC and PIM implementations. Many of the design decisions were remade for different reasons.

Through deeper analyses of the KL1 implementation on PIM with many experimental parallel application software on it, we could more quantitatively evaluate various implementation ideas, which influenced the design of KLIC considerably. The difference of the target hardware caused redesigning, of course. Some of the implementation ideas that needed architectural support for efficient processing were given up and several new ideas were incorporated instead.

Some of other basic software systems also needed nontrivial redesign for efficiency on general purpose hardware. In case of PIMs, we designed the whole computer systems from the architecture level with specially devised communication mechanisms, which the KL1 implementations could use directly. On the other hand, KLIC is for hardware with communication mechanisms with relatively low throughput and high latency. More importantly, the operating system layer in between the hardware and the language implementation incurs considerable overhead. Thus, some basic software systems had to be redesigned to reduce the amount and frequency of interprocessor communication. The distributed pool is a representative example.

The features of KLIC to link with programs in other languages enabled writing some parts of systems in languages other than KL1 best suited for their purposes. Most of the conventional languages are suited only for sequential processing, but KLIC can be used effectively to combine such sequential parts together into a parallel system. A parallel database management system Kappa was ported to KLIC with rewrites in the low-level data handling in the language C.

The language KL1 is a logic programming language but is not meant to be a theorem prover for the full first order logic. This is also true for Prolog, but giving up the automatic search feature of Prolog for better parallel execution control made the language level lower than Prolog. To fill the enlarged gap between the needs of application software and the features provided by KL1, a bottom-up theorem prover MGTP on PIMs was augmented with various features to be useful as a common tool or a programming language for knowledge processing and ported onto KLIC.

The rest of the paper gives more detailed descriptions of each R&D topic. The next section gives an overview of the evaluation effort made on the KL1 implementation on PIMs. Section 3 describes the outline of the KLIC implementation of KL1. The distributed pool, a table handling utility optimized for distributed processing, is described in section 4. Section 5 gives the outline of the database management system Kappa. R&D on the MGTP theorem prover is described in section 6. Finally, a conclusion is given.

2 Evaluation of PIMs

Five models of parallel inference machine PIM were built in the FGCS project. In the FGCS follow-on project, their evaluation through building experimental parallel application software continued.

In addition, the KL1 implementation on the PIM model-p has been inspected in more depth after various improvements. The overall processing speed of the implementation has been more than doubled in two years of the project period, smoothing out rough edges, making evaluation of the implementation scheme more reliable.

2.1 Memory Management

All the KL1 implementations on PIMs adopted a reference count memory management scheme with a single reference count bit in pointers [4]. Through analyses of the implementation, the following were found.

2.2 Shared Memory Implementation

The model-p of PIM has clusters with eight processors sharing memory and a bus, which in turn are connected with a hypercube network. Thus, its parallel implementation of KL1 has both shared and distributed memory portions.

Through the analyses of the shared memory portion of the implementation, we could find the following.

2.3 Distributed Memory Implementation

The distributed memory implementations of KL1 on PIMs adopted a two level addressing scheme with local and global addresses. Interprocessor memory management was through weighted reference counting.

Analyses of the implementation revealed the following.

2.4 KLIC on PIM

The sequential core of KLIC (described in the next section) was ported to the model-p of PIM. It provided an interesting basis for comparing two different implementation schemes.

The KLIC implementation showed performance similar to the original implementation in average. This was unexpected as the original implementation utilizes the special features of PIM hardware while KLIC implementation relies only on RISC-like instructions generated by a C compiler. The following are suspected reasons.
One point where KLIC was found to perform much inferior to the original implementation was goal suspension for data-flow synchronization. KLIC owes its performance to inline code expansion and thus the object code tends to become large. To alleviate this problem, compiled code for a program module has only one runtime subroutine call site commonly used for all the possible execution interrupts: goal suspension, garbage collection, external interrupt, etc. This increased the goal suspension cost considerably. However, it was also found that only few application programs had frequent suspensions.

Considering performance superiority of KLIC on Unix machines to other logic programming language systems, the porting proved that the original KL1 implementation on PIM was reasonably efficient, although some room remains for improvements.

Further details can be found in [13].

3 KLIC: A Portable KL1 Implementation

KLIC is a portable implementation of the concurrent logic programming language KL1. As almost all the software systems developed in the latter part of the FGCS project were written in KL1, it was essential to provide its portable implementation for dissemination of the FGCS technologies.

3.1 Design Policy

KLIC was designed under the following policies.

3.2 System Outline

The KLIC system consists of the following.
KL1 compiler
that translates programs in KL1 to plain C programs. Only those features commonly available with most of the C compilers are used.
Runtime Libraries
which are a collection of C subroutines called from the translated C code.
C Header Files
that provide data type definitions and inline expanded operations used in the translated C code and the runtime library.
Compiler Driver
that controls the compilation and linkage of KL1 programs.

Compilation of KL1 programs proceeds just like compilation of C programs except for an additional precompilation step from KL1 to C (Fig. 1). In a sense, KL1 is a so-called ``fourth generation language''.

	
Figure 1: Compilation of KL1 Programs


Different runtime libraries are provided for different forms of usage.

The simplest library provides the basic features and is meant for the production code. The debugging library is for program development providing tracing and better error diagnostics in addition to the basic features.

Another axis that characterizes different libraries is their parallel processing features. The sequential runtime library does no parallel processing. The shared memory and distributed memory parallel runtime libraries provide different parallel processing schemes. Furthermore, the core algorithm for the distributed memory implementation is made independent of physical communication mechanisms. Separate libraries are provided depending on the lower-level communication mechanisms used. Of course, the users are expected to install only those libraries of their need (Fig. 2).


Figure 2: Structure of Parallel Implementations


Note that, for all the different purposes, the C code translated from KL1 programs remains the same. Switching from production code to debugging code only requires linkage with a different library. This also makes optimizations on the compiler and the parallel implementation mechanisms orthogonal, making concurrent system design and development easier.

To use the same code for different parallel implementation schemes, some handles are required in the core implementation to recognize the needs of non-local processing. Explicit operations, such as designating a goal to be executed on a different worker, are easily recognized. Shared data accesses are implicit and data access operations have to provide handles to escape to appropriate runtime library routines. This is realized by using generic objects described in section 3.4 .

3.3 Basic Design

Taking into account of the evaluation results of the KL1 implementations on PIMs, differences of the target machine architectures, and recent research results on logic programming language implementations, we made the following design decisions for the core part of the KLIC system.

3.4 Generic Objects

With the KL1 implementations on PIM [7], we experienced severe difficulties in trying out different parallel execution schemes, as the schemes were too much integrated into the system core. This pointed to us a moral that system extensibility and modifiability should be put above bare efficiency. On the other hand, as the system is for stock hardware, only a limited number of pointer tag bits can be handled efficiently. We thus needed some other ways to distinguish various built-in data types.

Generic objects were introduced to achieve these two objectives at a time. We borrowed the basic idea of generic objects from AGENTS [11], modified and extended it for KLIC.

The core of the runtime system and compiled codes only know data types generically called generic objects. Generic objects of all classes have the same interface; new object classes can be freely added without changing the system core. For example, copying of objects on garbage collection is implemented as one of their methods, and thus the system core do not have to know their physical representation.

KLIC has three categories of generic objects.

Data Objects are immutable objects without time-dependent states. Time-independence means that they always look the same from KL1 programs; their physical representation may be modified. For example, multiversion arrays are data objects that look immutable, but different versions of an array are represented by a physical array and linked records of their differences, which are mutated on updates.

With a set of macros provided by the system, users can define an object class to interface C programs with the KLIC system in an object-oriented manner. On distributed memory implementations, object migration can be realized by simply providing a method to encode the object into a message and a corresponding decoder. Higher level decisions such as when to migrate objects and lower level communication procedures are taken care of by the KLIC system.

Consumer Objects are data-driven objects with time-dependent states. They are associated with an uninstantiated logical variable and activated by its instantiation. When the unifier recognizes that the instantiated variable is associated with a consumer object, the unify method of the object is called. The object performs the task specified by the given value and, when the value is a structured data, may associates itself again with a variable inside the structure. Consumer objects behave exactly like a process described in KL1 waiting for instantiation of a variable, except that arbitrary C code can describe its behavior.

Generator Objects are demand-driven objects. They are also associated with an uninstantiated variable but are activated on demand of its value. When a variable associated with a generator object is found to be required during goal suspension handling, the generate method of the object is called. It may generate some value immediately to instantiate the associated variable. Alternatively, it may initiate some activity that will eventually do it.

In parallel implementations [15][14], non-local references are represented as generator objects. Their generate methods send a value fetching message to another worker or read data from shared memory.

No distinction of the above-described three categories is made by AGENTS; It have no generators and the functionality of consumers is provided by ports which are data objects. Ports are names of streams rather than streams themselves.

3.5 Sequential Performance

With careful design of the generated C code, and thanks to excellent optimization by modern C compilers, the sequential core of KLIC shows reasonable performance. For a set of standard benchmark programs that use common features of Prolog and KL1, KLIC runs about twice as fast as native machine code generated by a widely used Prolog system, SICStus Prolog [1]. This was reconfirmed by the two versions of the KL1 to C compiler, one in Prolog and another in KL1 itself .

While SICStus generates machine code directly, KLIC's code generation is indirect through C. Deeper analysis of the generated code revealed that the lack of backtracking benefited KLIC considerably, but nothing seems to be lost by the indirect code generation scheme through C, while yielding excellent portability [3].

Note that, with the parallelization scheme of KLIC, this excellent sequential performance is directly inherited to the single processor performance of its parallel implementations.

3.6 Shared Memory Implementation

In the shared memory implementation of KLIC, each worker process has its own local heap area and shares one common heap area in addition. The local heap may be physically homogeneous with the shared heap; the logical distinction is to reduce the needs of locking and to increase access locality. The principal management policy is to allow pointers from a local heap to the shared heap but not the reverse.

Data structures are created in a local heap and stay there as far as no references from different workers are made. Data structures created for temporary use by one worker live their whole lives within the local heap.

When a reference is made from a different worker, either by passing the data structure as an argument of a distributed goal or by unifying it with a variable in the shared heap, the structure is copied to the shared heap. Nested structures are copied recursively down to the bottom in such cases, as they will be indirectly shared.

With the pure semantics of the language, data structures are read-only and thus need no special care even when they are shared. Uninstantiated variables in the shared heap call for special treatment, as their instantiations need locking and copying of the local data unified with them to the shared heap. Generic objects are used as handles to access shared variables (Fig. 3).


Figure 3: Shared Memory Implementation

As garbage collection is an operation without good memory access locality, garbage collection of the shared heap may easily saturate the memory bus. By separating local and shared heaps, individual workers can garbage-collect its own local heap independently and asynchronously. Also, an interesting research is going on to allow asynchronous garbage collection of shared heap [14].

3.7 Distributed Memory Implementation

The distributed memory implementation of KLIC inherits its basic design from the distributed memory implementation on PIMs [10]. Two level addresses, local and global, are used with address translation tables called export table (Fig. 4).


Figure 4: Distributed Memory Implementation

An important difference is that, unlike on PIMs where the operating system is built on the KL1 language implementation, KLIC programs are no more than application programs running under operating systems. Thus, unification failures can be treated as the failure of the whole program in case of KLIC. This considerably simplified the distributed unification mechanism [15].

The basic common part of the distributed memory implementation translates the needs of the KL1 programs into message exchange. Each concrete implementation then realizes message exchange using some lower-level mechanism. Portable implementations based on message passing libraries PVM and MPI, and those based on more system dependent features such as active messages are being developed.

3.8 Portability

The sequential core of the KLIC system has been ported to many Unix-based systems, including systems provided by DEC, NEC, Hitachi, Omron, SGI, Sony, and Sun. Porting to other Unix-based systems usually requires only adding a line or two to the configuration script, if any. The system has also be ported to personal computers running Linux, MS-DOS and OS2.

The shared memory parallel implementation is running on shared-memory multiprocessors provided by DEC and Sun at the time this manuscript is prepared (November 1994). It can be easily ported to other systems with shared memory features by rewriting only the parts of the code for shared memory allocation, locking, and cache coherence (on systems with weak cache coherence semantics).

The distributed memory parallel implementation is currently running on PVM. It runs on workstations connected via ethernet and is also reported to run on Intel Paragon. A shared memory implementation of the distributed memory implementation, which uses the shared memory for exchanging messages, is also running. An MPI implementation and implementations using system specific interprocessor communication features are also going on for AP1000, Cenju3, CM5, SR2001, and some other systems.

Further details can be found in [6].

4 Distributed Pool

The pool is a table maintenance utility provided by the PIMOS operating system [5] of PIMs. It provides various forms of tables with arbitrary keys and data and has been extensively used by experimental parallel application programs on PIMs.

The original pool itself did not run in parallel. All the data is managed by a single process. This had the following problems.
The access latency problem will become more apparent on systems where the speed ratio of computation and communication is larger, such as some of the targets of the KLIC system.

The distributed pool was designed to solve the above problems by adopting the following schemes.

To maintain the semantics of the original pool utility, a cache coherence algorithm was designed. The algorithm is similar to ones used in keeping physical memory cache consistency except that communication between caches is asynchronous and thus requires special care on message crossing and outstripping.

Further details can be found in [16].

5 Kappa: A Parallel Database Management System

Kappa is a parallel database management system (DBMS) for providing efficient database management facilities for knowledge information processing applications. Its data management is based on a nested relational model, in which complex structured data can be handled much more efficiently than in the conventional relational model.

Kappa was originally developed as a sequential DBMS and evolved to a parallel version in the FGCS project. In the FGCS follow-on project, the parallel version of Kappa was ported onto KLIC.

Kappa consists of a number of element DBMSs running in parallel managing their own data. Each of the element DBMS has the full functionality of a DBMS. A global map of relations in element DBMSs is managed by server DBMSs. The placement of the data can be configured depending on the amount of and access patterns to the data. One relation can be divided horizontally into subrelations and distributed. From the users, the relation still looks as a single relation logically. For frequently accessed data, copies of one relations can be distributed for increased access locality (Fig. 5).


Figure 5: An Example Configration of Kappa

In the KLIC version of Kappa, some lower level data manipulation portions of Kappa were recoded in the language C directly. This made its sequential performance close to conventional relational DBMSs on Unix systems, while retaining its parallel and distributed processing abilities and added flexibility of its nested relational model.

Further details can be found in [12].

MGTP: A Bottom-Up Theorem Prover

A theorem prover MGTP was built on PIMs in the FGCS main project. The proof mechanism of MGTP is through generation of models from a given axiom set and a theorem to be proved in a bottom-up manner.

In a sense, the system was to supplement KL1 with its more general logic programming capabilities. The MGTP system showed remarkable performance through highly parallel processing on PIMs for certain classes of problems. One of its representative results is an automatic proof of an open problem in quasi-group theory. However, it had relative weakness for problems of certain classes when compared to top-down provers or constraint-based systems. Also, its surface syntax was not easily accepted as a programming language by application programmers.

In the FGCS follow-on project, MGTP has been refined and extended with various features to be useful as a common tool or a programming language for knowledge processing and also ported onto KLIC.

To overcome its weakness on certain classes of problems, a technique called non-Horn magic set (NHM) has been developed. MGTP generates models from a given axiom set to prove theorems, but literals irrelevant to the proof were generated also. The NHM technique controls the model generation to suppress case splits redundant or unnecessary for the proof, improving its performance remarkably.

To solve constraint satisfaction problems more efficiently, a version with constraint propagation features, called Constraint MGTP (CMGTP) has been developed. For higher level description, a method to translate formulas of modal logic into those of first-order logic has been developed, so that proofs in modal logic can be handled by MGTP. For efficiency on systems with slower interprocessor communication, a version with higher data access locality, called distributed MGTP has also been developed.

In the course of development of various systems, tools for visualizing the progress of proofs were also developed. Such tools has been essential in tuning the proof strategies.

Further details can be found in [8].

7 Conclusion

The KL1 language system and principal basic software systems have been successfully ported to the hardware available in the market. By releasing them as free software, many researchers outside of the project started to base their research on these basic software systems. Still wider use of them can be expected in coming years. In this viewpoint, we may conclude that the research and development of the basic software systems in the FGCS follow-on project was quite successful.

On the other hand, there remains much room for improvements. Further optimizations with static analyses are needed to make the performance of KLIC programs approach programs directly written in C. Frequency of interprocessor communication should be reduced through static analyses. Kappa and MGTP also have rooms for performance improvements and better user and program interfaces. We expect that a wider but probably more loosely coupled research community to be formed after the termination of the FGCS follow-on project to look into these research topics.

Acknowledgments

Reports on R&D on individual topics are based on the papers by and discussion with those who carried out research and development on individual topics, including Koichi Kumon, Hiroyoshi Hatazawa, Tetsuro Fujise, Masaki Sato, Moto Kawamura and Ryuzo Hasegawa.

References

1
Mats Carlsson, Johan Wid en, Johan Andersson, Stefan Andersson, Kent Brootz, Hans Nilsson, and Thomas Sjoland. SICStus Prolog User's Manual, 1993.

2
Takashi Chikayama. Operating system PIMOS and kernel language KL1. In Proceedings of FGCS'92, pages 73-88, Tokyo, Japan, 1992.

3
Takashi Chikayama, Tetsuro Fujise, and Daigo Sekita. A portable and efficient implementation of kl1. In Manuel Hermenegildo and Jaan Penjam, editors, Proceedings of PLILP'94, pages 25-39, Berlin, 1994. Springer-Verlag.

4
Takashi Chikayama and Yasunori Kimura. Multiple reference management in flat GHC. In Proceedings of 4th International Conference on Logic Programming, 1987.

5
Takashi Chikayama, Hiroyuki Sato, and Toshihiko Miyazaki. Overview of the parallel inference machine operating system (PIMOS). In Proceedings of FGCS'88, pages 230-251, Tokyo, Japan, 1988.

6
Tetsuro Fujise, Takashi Chikayama, Kazuaki Rokusawa, and Akihiko Nakase. Klic: A portable implementation of kl1. In Proceedings of FGCS'94, 1994.

7
Atsuhiro Goto, Masatoshi Sato, Katsuto Nakajima, Kazuo Taki, and Akira Matsumoto. Overview of the parallel inference machine architecture (PIM). In Proceedings of FGCS'88, Tokyo, Japan, 1988.

8
Ryuzo Hasegawa. Parallel theorem provers - MGTP ---. In Proceedings of FGCS'94, 1994.

9
Hiroyoshi Hatazawa. Evaluation of Fine-Grain Automatic Load Balancing Method on the Shared Memory Parallel Machine Journal of Information Processing Society of Japan, 35(10):2069-2077, October 1994. in Japanese.

10
Nobuyuki Ichiyoshi, Kazuaki Rokusawa, Katsuto Nakajima, and Yu Inamura. A new external reference management and distributed unification for KL1. In Proceedings of FGCS'88, Tokyo, Japan, 1988. Also in New Generation Computing 7-2, 3 (1990), pp. 159-177.

11
Sverker Janson, Johan Montelius, Kent Boortz, Per Brand, Bjorn Carlson, Ralph Clarke Haygood, Bjorn Danielsson, and Seif Haridi. AGENTS user manual. SICS technical report, Swedish Institute of Computer Science, 1994.

12
Moto Kawamura and Toru Kawamura. Parallel database management system: Kappa. In Proceedings of FGCS'94, 1994.

13
Koichi Kumon. Evaluation of parallel inference machine PIM. In Proceedings of FGCS'94, 1994.

14
Masao Morita, Nobuyuki Ichiyoshi, and Takashi Chikayama. A shared-memory parallel execution scheme of KLIC. ICOT technical report, ICOT, 1994.

15
Kazuaki Rokusawa, Akihiko Nakase, and Takashi Chikayama. Distributed memory implementation of KLIC. ICOT technical report, ICOT, 1994.

16
Masaki Sato, Masahiko Yamauchi, and Takashi Chikayama. Distirbuted pool and its implementation. In Proceedings of FGCS'94, 1994.

17
Kazunori Ueda and Takashi Chikayama. Design of the kernel language for the parallel inference machine. The Computer Journal, 33(6):494-500, December 1990.

footnotes

  1. Actually, we constrained ourselves to a still smaller subset to enable versions on DOS, OS2 and parallel systems without full Unix functionalities on processing nodes.

  2. The Prolog version is a by-product of the earlier development phase of KLIC.

9 All of this proceedings (Compressed PostScript file) :

Parallel Basic Software (88KB)


www-admin@icot.or.jp