Parallel Logic Simulator ABSTRACT Logic simulation, one of the most time-consuming stages in LSI design, is used to verify the logical and timing specifications of designed circuits. We built a high-performance parallel logic simulator capable of deriving a lot of parallelism from the target circuits so as to exploit the entire potential of large-scale MIMD machines, such as the PIM machine. KEY FEATURES Time Warp mechanism (TW) TW asynchronously controls the order in which messages should be eval- uated. It tries to exploit complete parallelism with speculative computa- tion, while the rollback process cancels speculation errors. Reduction of rollback overhead The antimessage reduction mechanism, adequate message scheduling, and the Adaptively Moving Time-Ceiling are used to reduce the cost and fre- quency of rollback. Cascading-Oriented Partitioning (COP) COP provides high-quality solutions for circuit partitioning, achieving low inter-PE communication, high parallelism extraction and load balancing. Process Flow Parallel Logic Simulation - 51 - Background Logic simulators are used in order to verify the logical and timing speci- fication of designed circuits. Since logic simulation is one of the most time- consuming stages in LSI design, faster simulators are required. In addition, flexibility is also needed. A parallel logic simulator is one likely way of producing quick and flexible simulation. We built a parallel logic simulator on the PIM machine, as a first but significant step to the realization of super high performance simulators on future hyper-parallel machines. Specification of the Simulator The simulator simulates combinatorial circuits and sequential circuits that have feedback loops. It handles three values: Hi, Lo, and X (unknown). A different delay time can be assigned to each gate (non-unit delay model). Since this simulator treats gates only, flip-flops and other functional blocks should be completely decomposed into gates. Parallel Discrete Event Simulation and Time Warp Parallel logic simulation is treated as a typical application of parallel dis- crete event simulation (PDES). PDES can be modeled so that several ob- jects (corresponding to gates) change their states by communicating with each other. A message has information of an event whose occurrence time is stamped on the message (time-stamp). Since messages should be received and evaluated in the time-stamp order by their destination objects, a time-keeping mechanism is needed. We adopted the Time Warp mechanism (TW) as the time-keeping mech- anism. In TW, each object usually acts according to received messages and also records the history of messages and states, optimistically assuming that messages arrive chronologically. When a message arrives at an object out of time-stamp order, however, the object rewinds its history (this process is called rollback), and makes adjustments as if the message had arrived in cor- rect time-stamp order. If there are messages which should not have been sent, the object also sends antimessages in order to cancel those messages. The rollback process has been suspected to contain large overheads which might affect performance adversely. So, we added several devices to reduce the overheads so that the simulator can run efficiently on the PIM machine. - 52 - Figure 1: Time Warp mechanism Message Scheduling During simulation, there are usually several messages to be evaluated in a PE, and therefore some scheduling strategy is needed. In our strategy, the message with the smallest time-stamp is evaluated first. This strategy is considered to be adequate for TW because the bigger time-stamp a message has, the more likely the message is to be rolled back. Antimessage Reduction As long as messages are sent through the KL1 stream, the messages arrive at the receiver in the same order as they were transmitted. In this environ- ment, subsequent antimessages can be reduced. We adopted this optimization technique, expecting that it would reduce the rollback cost. Figure 2: Antimessage Reduction - 53 - Adaptively Moving Time-Ceiling In a naively implemented TW, a message is evaluated as soon as it arrives at its destination gate, even if its time-stamp is extremely large. Such a message, however, will probably be rolled back later and thus the evaluation may cause other side-effects. Moving Time-Ceiling (MTC) gives the upper limit of time for suppressing both the evaluation and generation of messages with extremely large time- stamps. MTC contributes to reducing the frequency of rollback. MTC is sometimes updated into the future. However, it is quite difficult to statically determine the optimal interval between the old MTC position and the new. If the interval is too short, many processors might idle although the rollback frequency would be greatly reduced. Conversely, if MTC jumps to the distant future, the rollback frequency will hardly be reduced at all. In our simulator, the MTC interval is adaptively determined according to the rollback frequency and the effect of the previous intervals on performance (Adaptively Moving Time-Ceiling). AMTC can exploit the entire potential of parallel machines without depending on target circuits or simulation con- ditions. Figure 3: Adaptively Moving Time-Ceiling - 54 - Circuit Partitioning In our simulator, the "Cascading-Oriented Partitioning" strategy is used for circuit partitioning to attain high-quality load distribution. Firstly, COP makes several clusters by grouping gates that are connected to each other into a cascade-form. Then, the clusters are cut or merged to adjust their sizes. Finally, the clusters are assigned to processors at random. This scheme provides adequate partitioning solutions that satisfy these three requirements: load balancing, keeping inter-PE communication fre- quency low and extracting a lot of parallelism. Performance Measurement We simulated five sequential circuits on the PIM/m machine. These were benchmark data for ISCAS'89. Table 1 shows the size (# of gates) of the target circuits. Figure 5 indicates the performance of the simulator. Table 1: Size of the circuits In the best case, 537K events/sec performance and 166-fold speedup were attained using 256 processors. - 55 - Figure 5: Performance vs. # of Processors Outline of Demonstration In the demonstration, two sequential circuits, s15850 and s38584, will be simulated. The demonstration will focus on these three points. . The utilization of processors. . The adequacy of the intervals in AMTC. . Total performance and rollback frequency. - 56 -