Fault-Tolerant Parallel Software Library


by

Eiji Sugino and Haruo Yokota at JAIST (Japan Advanced Inst. of Sci. and Tech., Hokuriku)



[Function]

It executes programs like samples as fault-tolerant parallel software.

distribution(Primary,Backup) : in watchdog.kl1

divides all nodes into two groups, and fork watchdog processes on each nodes.

copy(Args,Args1,Args2,Interrupt) : in copy.kl1

replicates arguments 'Args' into 'Args1' and 'Args2', and prepares forwarding processes for each variables.

[Usages]

You should compile your program with

1) watchdog.kl1 and copy.kl1, and

2) ncube.kl1 or sparc.kl1 depended on your system

for example,
% klic -v -dp -o dofts watchdog.kl1 copy.kl1 sparc.kl1 main.kl1 ft_queen.kl1
% ./dofts -n -p5 7d
Attention) You should put nodes, number of which is more than 2.

[Files]

fts/
README.j
README-j.html
README.e
README-e.html
watchdog.kl1
copy.kl1
ncube.kl1
% for nCUBE2
sparc.kl1
% for SPARC
=== samples ===
------- sample 1 -------
main.kl1
queen.kl1
ft_queen.kl1
ft_queen_faulty.kl1
------- sample 2 -------
main-sim.kl1
f_sim.kl1
f_sim_faulty.kl1

[Example]

This program is under construction and we cannot complete as tracing facility on distributed KLIC is not available. Now you can see an only example execution.

1) compile it

% make sparc

2) do without any error

% ./dofts-s -p5 10 100d
Leader [1]
Leader [2]
Backup Site Group : 2 4
Primary Site Group : 1 3
30 [67,1]
<== outputs A (30) and B ([67,1])
Response time is 1359 msec

3) do with an error

amethyst[300]% ./dofts-sf -p5 10 100d
Leader [1]
Leader [2]
Primary Site Group : 1 3
Backup Site Group : 2 4
BOMB! 3
<== 'exit' is called on Node 3
30 FAULT was detected on PRIMARY SITE!
<== fault is detected after output 'A'
FAULT was informed to BACKUP!
BACKUP change to PRIMARY !!! REBIRTH (1) [4]
REBIRTH (2) [4]
... REBIRTH ndet_replay [117,1]
<== outputs B after some messages
^Ckill tasks from io_server
<== finish with ^C

Because Fault-tolerant Conversion program is under construction, you should write one by hand. And we only guide your programming.

[Guide for FATPAS]

1) Split program into two parts; host program and fault-tolerant one.

ex. Original user program is constructed of main.kl1 and queen.kl1 . And 'queen:queen(N,Result)' is the target for FATPAS.

2) Convert the target program as followings.

2-1) The top clause is converted as followings.

% Original HEAD
queen(N,R) :-
% fork watchdogs
watchdog:distribution(Primary,Backup),
%
queen_1({N,R},Primary,Backup) @ lower_priority(10).
%
queen_1(Args,[PTop|Primary],[BTop|Backup]) :-
% replicate arguments
copy:copy(Args,Args1,Args2,Interrupt),
% merge the interrupt streams
Interrupt = {Interrupt1,Interrupt2}, Log = ack(Log1),
% fork a top goal for each sites
PTop = {primary,queen,queen,Args1,Log,Signal,Interrupt1},
BTop = {backup ,queen,queen,Args2,Log1,Signal,Interrupt2}.

2-2) Make clauses for the top goal in module 'exgoal'.

Define following clauses between '=-=-=-=-' lines.
:- module exgoal.
call_goal(Site,Module,Predicate,Args,Log,GSig,Raise)-SC :-
call_goal_0(Site,Module,Predicate,Args,Log,GSig,Raise)-SC @ lower_priority.
% =-=-=-=-
call_goal_0(primary,queen,queen,{A,B},Log,GSig,Raise)-SC :-
queen:queen_record(A,B,Log)+GSig+Raise-SC.
call_goal_0(backup ,queen,queen,{A,B},Log,GSig,Raise)-SC :-
queen:queen_replay(A,B,Log)+GSig+Raise-SC.
% =-=-=-=-
otherwise.
call_goal_0(Type,Module,Method,Arguments,Log,GSignal,Raise)-SC :-
klicio:klicio([stdout(normal(Out))]),
variable:wrap((Type::Module:Method/Arguments), G),
Out = [fwrite("Illegal goal invocation : "), putwt(G), nl,fflush(_)],
Raise = [].

2-3) Make clauses for the top goal in original module.

(1) It's like Instant Replay conversion.

% Record Version
queen_record(N,X,Log) :-
current_node(_,All),
queen_0_record(N,X,~(All-1),Log)@node(1).
queen_0_record(4,X,A,Log) :- queen_record([1,2,3,4],[],[],X,A,Log).
....
queen_record([P|U],C,L,I,PE,Log) :-
Log = c1(Log1,Log2,Log3),
TO:= (P mod PE)+1,
throw_record(U,[P|C],L,I2,PE,TO,Log1),
merge_record(I1,I2,I,Log2),
append(U,C,N),
c1_record(P,1,N,L,L,I1,PE,Log3).
% Replay Version
queen_replay(N,X,Log) :-
current_node(_,All),
queen_0_replay(N,X,~(All-1),Log)@node(1).
queen_0_replay(4,X,A,Log) :- queen_replay([1,2,3,4],[],[],X,A,Log).
....
queen_replay([P|U],C,L,I,PE,Log) :-
Log = c1(Log1,Log2,Log3) |
TO:= (P mod PE)+1,
throw_replay(U,[P|C],L,I2,PE,TO,Log1),
merge_replay(I1,I2,I,Log2),
append(U,C,N),
c1_replay(P,1,N,L,L,I1,PE,Log3).

(2) Add following arguments for all user-defined predicates.

GSig : is for interruption from top to leaf. You should carry it to sub-goals.
Raise: is for signal from leaf to top. When the clause has several sub-goals, you should merge them in sub-goals to one.
SC: is a variable pair for short-circuit detection. You only put it on all goals.
% Record Version
queen_record(N,X,Log)+GSig+Raise-SC :-
current_node(_,All),
queen_0_record(N,X,~(All-1),Log)+GSig+Raise-SC @node(1).
queen_0_record(4,X,A,Log)+GSig+Raise-SC :-
queen_record([1,2,3,4],[],[],X,A,Log)+GSig+Raise-SC.
....
queen_record([P|U],C,L,I,PE,Log)+GSig+Raise-SC :-
Raise = {Raise1,Raise2,Raise3,Raise4},
Log = c1(Log1,Log2,Log3),
TO:= (P mod PE)+1,
throw_record(U,[P|C],L,I2,PE,TO,Log1)+GSig+Raise1-SC,
merge_record(I1,I2,I,Log2)+GSig+Raise2-SC,
append_record(U,C,N)+GSig+Raise3-SC,
c1_record(P,1,N,L,L,I1,PE,Log3)+GSig+Raise4-SC.

(3) In Record Version, you should put a synchronization argument 'ack(...)' in head goal as followings.

queen_record([P|U],C,L,I,PE,ack(Log))+GSig+Raise-SC :-
Raise = {Raise1,Raise2,Raise3,Raise4},
Log = c1(Log1,Log2,Log3),
TO:= (P mod PE)+1,
throw_record(U,[P|C],L,I2,PE,TO,Log1)+GSig+Raise1-SC,
merge_record(I1,I2,I,Log2)+GSig+Raise2-SC,
append_record(U,C,N)+GSig+Raise3-SC,
c1_record(P,1,N,L,L,I1,PE,Log3)+GSig+Raise4-SC.
queen_record([],[_|_],_,I,ack(Log))+Sig+Raise-SC:-
Raise=[],
Log=c2(Ack),
I=[].

(4) In Record Version, you should put a synchronization goal 'output(...)' for body-unification as followings.

queen_record([],[_|_],_,I,ack(Log))+Sig+Raise-SC:-
Raise=[],
Log=c2(Ack),
output(Ack,I, [])-SC.
output(Ack,X,Y)-SC :- wait(Ack) | X = Y.

(5) In Replay Version, you should put a clause for interruption check.

% Replay Version
queen_replay(A, B, Log)+GSig+Raise-SC :-
(wait(Log) -> queen_replay_0(A, B, Log)+GSig+Raise-SC ;
alternatively;
GSig = [rebirth|GSig1] -> queen_record(A, B, _)+GSig1+Raise-SC).
queen_replay_0(N,X,Log)+GSig+Raise-SC :-
current_node(_,All),
queen_0_replay(N,X,~(All-1),Log)+GSig+Raise-SC @node(1).
queen_0_replay( 4,X,A,Log)+GSig+Raise-SC :-
queen_replay([1,2,3,4],[],[],X,A,Log)+GSig+Raise-SC.
....
queen_record([P|U],C,L,I,PE,Log)+GSig+Raise-SC :-
Log = c1(Log1,Log2,Log3) |
Raise = {Raise1,Raise2,Raise3,Raise4},
TO:= (P mod PE)+1,
throw_replay(U,[P|C],L,I2,PE,TO,Log1)+GSig+Raise1-SC,
merge_replay(I1,I2,I,Log2)+GSig+Raise2-SC,
append_replay(U,C,N)+GSig+Raise3-SC,
c1_replay(P,1,N,L,L,I1,PE,Log3)+GSig+Raise4-SC.
queen_replay([],[_|_],_,I,ack(Log))+Sig+Raise-SC:-
Log=c2(Ack) |
Raise=[],
I=[].

(6) You should change throwing goals ('goal @ node(N)') into unification goals as followings.

throw_record(A,B,C,D,E,F,Log)+GSig+Raise-SC :-
Raise = [goal(primary,queen,queen,{A,B,C,D,E},Log)].
throw_replay(A,B,C,D,E,F,Log)+GSig+Raise-SC :-
Raise = [goal(primary,queen,queen,{A,B,C,D,E},Log)].
In this version, goals are thrown to the neighboring node, so you need not add destination node number.

(7) You should add clauses for throwing goals in module 'exgoal'.

call_goal_0(primary,queen,queen,{A,B,C,D,E},Log,GSig,Raise)-SC :-
queen:queen_record(A,B,C,D,E,Log)+GSig+Raise-SC.
call_goal_0(backup ,queen,queen,{A,B,C,D,E},Log,GSig,Raise)-SC :-
queen:queen_replay(A,B,C,D,E,Log)+GSig+Raise-SC.
....
otherwise.
call_goal_0(Type,Module,Method,Arguments,Log,GSignal,Raise)-SC :-
klicio:klicio([stdout(normal(Out))]),
variable:wrap((Type::Module:Method/Arguments), G),
Out = [fwrite("Illegal goal invocation : "), putwt(G), nl,fflush(_)],
Raise = [].
Attention) You can get "ft_queen.kl1" after some optimization.
sugino@jaist.ac.jp