I would like to achieve 0.5-1 million remote function calls per second. Let's assume we have one Central
computer where computation starts, and one Worker
computer which does the computation. There will be many Worker computers in real configuration.
Let's assume for a moment that our task is to calculate a sum of [(random int from 0 to MAX_VAL)*2], PROBLEM_SIZE times
The very naive prototype is
Worker:
//The real function takes 0.070ms to compute.
int compute(int input) {
return input * 2;
}
void go() {
try {
ServerSocket ss = new ServerSocket(socketNum);
Socket s = ss.accept();
System.out.println("Listening for " + socketNum);
DataInput di = new DataInputStream(s.getInputStream());
OutputStream os = s.getOutputStream();
byte[] arr = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(arr);
for (; ; ) {
wrap.clear();
di.readFully(arr);
int value = wrap.getInt();
int output = compute(value);
wrap.clear();
byte[] bytes = wrap.putInt(output).array();
os.write(bytes);
}
} catch (IOException e) {
System.err.println("Exception at " + socketNum);
e.printStackTrace();
}
}
Central:
void go(){
try {
Socket s = new Socket(ip, socketNum);
s.setSoTimeout(2000);
OutputStream os = s.getOutputStream();
DataInput di = new DataInputStream(s.getInputStream());
System.out.println("Central socket starting for " + socketNum);
Random r = new Random();
byte[] buf = new byte[4];
ByteBuffer wrap = ByteBuffer.wrap(buf);
long start = System.currentTimeMillis();
long sum = 0;
for(int i = 0; i < n; i++) {
wrap.clear();
int value = r.nextInt(10000);
os.write(wrap.putInt(value).array());
di.readFully(buf);
wrap.clear();
int answer = wrap.getInt();
sum += answer;
}
System.out.println(n + " calls in " + (System.currentTimeMillis() - start) + " ms");
} catch(SocketTimeoutException ste) {
System.err.println("Socket timeout at " + socketNum);
}
catch (Exception e) {
e.printStackTrace();
}
If the ping is 0.150ms and we run 1-threaded Worker, and 1-threaded Central, each iteration will take ~0.150ms. To improve performance, I run N
threads on both Worker and Central, n
-th thread listens to port 2000+n
. After each thread stops, we sum up the result.
Benchmarks
First, I ran the program above in my fellow's school network. Second, I ran it on two Amazon EC2 Cluster instances. Gap in results was very big.
CHUNK_SIZE = 100_000
in all runs.
Fellow's network:
I think 3 years ago it was top configuration available (Xeon E5645). I believe it is heavily optimized for parallel computations and has simple LAN topology since it has only 20 machines.
OS: Ubuntu
Average ping: ~0.165ms
N=1 total time=6 seconds
N=10 total time=9 seconds
N=20 total time=11 seconds
N=32 total time=14 seconds
N=100 total time=21 seconds
N=500 total time=54 seconds
Amazon network:
I ran the program on two Cluster Compute Eight Extra Large Instance (cc2.8xlarge) started in the same Placement Group.
OS is some amazonian linux
Average ping: ~0.170ms.
results were a bit disappointing:
N=1 total time=16 seconds
N=10 total time=36 seconds
N=20 total time=55 seconds
N=32 total time=82 seconds
N=100 total time=250 seconds
N=500 total time=1200 seconds
I ran each configuration 2-4 times, results were similar, mostly +-5%
Amazon N=1 result makes sense, since 0.170ms per function call = 6000 calls per second = 100_000 calls per 16 seconds. 6 seconds for Fellow's network are actually surprising.
I think that maximum TCP packets per second with modern networks is around 40-70k per second. It corresponds with N=100, time=250 seconds: N*CHUNK_SIZE / time = 100 * 100_000packets / 250sec = 10_000_000packets / 250sec = 40_000packets/second.
The question is, how my Fellow's network/computer configuration managed to do so well, especially with high N values?
My guess: it is wasteful to put each 4byte request and 4byte response to individual packet since there is ~40 byte overhead. It would be wise to pool all these tiny requests for, say, 0.010ms and put them in one big packet, and then redistribute the requests to the corresponding sockets. It is possible to implement pooling on application level, but seems that Fellow's network/OS is configured to do it.
Update: I've played with java.net.Socket.setTcpNoDelay(), it didn't change anything.
The ultimate goal: I approximate equation with millions of variables using very large tree. Currently, tree with 200_000 nodes fits in RAM. However I am intrested to approximate equation which requires tree with millions of nodes. It would take few Terabytes of RAM. Basic idea of algorithm is taking random path from node to leaf, and imroving values along it. Currently program is 32-threaded, each thread does 15000 iterations per second. I would like to move it to cluster with same iterations per second number.