For the stated goal the multi-threading has to be pre-emptive.
Simple Forths have a PAUSE-ing task-loop that runs tasks
one after the other, never overlapping. Surprisingly useful
but not in this case.
Modern, professional, Forth can do multi-threading but I
know of only one with special primitives to make it easier.
The example matrix multiplication given earlier is not an
demonstration of multi-threading.
To my knowledge (*), only the iForth compiler has
special multi-threading primitives (OCCAM based),
and comes with examples that really run x-times faster
on n-core processors (where x < n). For the matrix
code I would use its PAR .. ENDPAR where the threads
access rows and colums that stay far apart in memory,
to prevent cache pollution. There is another primitive
that automatically splits up DO-LOOPs for you, in the
way needed for this task.
An example of this syntax for 8 threads is:
0 VALUE jj
: mmul2 ( F: -- r )
a3 /size DFLOATS ERASE
/rsz 0 DO
I TO jj
PAR
STARTP /rsz 0 DO a1 jj /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 1+ /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 1+ /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 2+ /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 2+ /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 3 + /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 3 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 4 + /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 4 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 5 + /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 5 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 6 + /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 6 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
STARTP /rsz 0 DO a1 jj 7 + /rsz * I + DFLOAT[] DF@ a2 I /rsz * DFLOAT[] a3 jj 7 + /rsz * DFLOAT[] /rsz DAXPY_sse2 LOOP ENDP
ENDPAR
8 +LOOP
0e a3 /size 0 ?DO DF@+ F+ LOOP DROP ;
For 1024 x 1024 matrices this (mmul2) is about twice faster than the single-thread version (mmul1).
FORTH> TESTS
DOT/AXPY using 64 bits floats.
Vector size = 1048576
mul0 (dot) : 6.8719411200000000000e+0013 0.133 seconds elapsed.
mul1 (dot_sse2) : 6.8719411200000000000e+0013 0.106 seconds elapsed.
mmul0 (axpy) : 5.6294941655040000004e+0014 0.981 seconds elapsed.
mmul1 (axpy_sse2) : 5.6294941655040000004e+0014 0.400 seconds elapsed.
mmul2 (Paxpy_sse2) : 5.6294941655040000004e+0014 0.114 seconds elapsed. ok
(*) Rumor has it that MPE and Forth Inc recently added
similar functionality.