1 million remote function calls per second

Question

I would like to achieve 0.5-1 million remote function calls per second. Let's assume we have one Central computer where computation starts, and one Worker computer which does the computation. There will be many Worker computers in real configuration.

Let's assume for a moment that our task is to calculate a sum of [(random int from 0 to MAX_VAL)*2], PROBLEM_SIZE times The very naive prototype is

Worker:

//The real function takes 0.070ms to compute.
int compute(int input) {
    return input * 2;
}

void go() {
    try {
        ServerSocket ss = new ServerSocket(socketNum);

        Socket s = ss.accept();
        System.out.println("Listening for " + socketNum);

        DataInput di = new DataInputStream(s.getInputStream());
        OutputStream os = s.getOutputStream();
        byte[] arr = new byte[4];
        ByteBuffer wrap = ByteBuffer.wrap(arr);

        for (; ; ) {
            wrap.clear();

            di.readFully(arr);
            int value = wrap.getInt();
            int output = compute(value);

            wrap.clear();
            byte[] bytes = wrap.putInt(output).array();
            os.write(bytes);
        }

    } catch (IOException e) {
        System.err.println("Exception at " + socketNum);
        e.printStackTrace();
    }
}

Central:

void go(){    
    try {

        Socket s = new Socket(ip, socketNum);
        s.setSoTimeout(2000);
        OutputStream os = s.getOutputStream();
        DataInput di = new DataInputStream(s.getInputStream());

        System.out.println("Central socket starting for " + socketNum);

        Random r = new Random();

        byte[] buf = new byte[4];
        ByteBuffer wrap = ByteBuffer.wrap(buf);

        long start = System.currentTimeMillis();
        long sum = 0;

        for(int i = 0; i < n; i++) {
            wrap.clear();
            int value = r.nextInt(10000);

            os.write(wrap.putInt(value).array());

            di.readFully(buf);
            wrap.clear();

            int answer = wrap.getInt();
            sum += answer;
        }

        System.out.println(n + " calls in " + (System.currentTimeMillis() - start) + " ms");
    } catch(SocketTimeoutException ste) {
        System.err.println("Socket timeout at " + socketNum);
    }

    catch (Exception e) {
        e.printStackTrace();
    }

If the ping is 0.150ms and we run 1-threaded Worker, and 1-threaded Central, each iteration will take ~0.150ms. To improve performance, I run N threads on both Worker and Central, n-th thread listens to port 2000+n. After each thread stops, we sum up the result.

Benchmarks

First, I ran the program above in my fellow's school network. Second, I ran it on two Amazon EC2 Cluster instances. Gap in results was very big.

CHUNK_SIZE = 100_000 in all runs.

Fellow's network:

I think 3 years ago it was top configuration available (Xeon E5645). I believe it is heavily optimized for parallel computations and has simple LAN topology since it has only 20 machines.

OS: Ubuntu

Average ping: ~0.165ms

N=1 total time=6 seconds 
N=10 total time=9 seconds
N=20 total time=11 seconds
N=32 total time=14 seconds
N=100 total time=21 seconds
N=500 total time=54 seconds

Amazon network:

I ran the program on two Cluster Compute Eight Extra Large Instance (cc2.8xlarge) started in the same Placement Group.

OS is some amazonian linux

Average ping: ~0.170ms.

results were a bit disappointing:

N=1 total time=16 seconds 
N=10 total time=36 seconds
N=20 total time=55 seconds
N=32 total time=82 seconds
N=100 total time=250 seconds
N=500 total time=1200 seconds

I ran each configuration 2-4 times, results were similar, mostly +-5%

Amazon N=1 result makes sense, since 0.170ms per function call = 6000 calls per second = 100_000 calls per 16 seconds. 6 seconds for Fellow's network are actually surprising.

I think that maximum TCP packets per second with modern networks is around 40-70k per second. It corresponds with N=100, time=250 seconds: N*CHUNK_SIZE / time = 100 * 100_000packets / 250sec = 10_000_000packets / 250sec = 40_000packets/second.

The question is, how my Fellow's network/computer configuration managed to do so well, especially with high N values?

My guess: it is wasteful to put each 4byte request and 4byte response to individual packet since there is ~40 byte overhead. It would be wise to pool all these tiny requests for, say, 0.010ms and put them in one big packet, and then redistribute the requests to the corresponding sockets. It is possible to implement pooling on application level, but seems that Fellow's network/OS is configured to do it.

Update: I've played with java.net.Socket.setTcpNoDelay(), it didn't change anything.

The ultimate goal: I approximate equation with millions of variables using very large tree. Currently, tree with 200_000 nodes fits in RAM. However I am intrested to approximate equation which requires tree with millions of nodes. It would take few Terabytes of RAM. Basic idea of algorithm is taking random path from node to leaf, and imroving values along it. Currently program is 32-threaded, each thread does 15000 iterations per second. I would like to move it to cluster with same iterations per second number.

Maybe it is (then please give me a hint where should I search the answer about OS network configuration), or maybe current approach is flawed, or I SHOULD implement pooling myself, or there is a ready solution/framework for such high-frequency remote function calls — , Feb 22 '13 at 14:40
Some other things to think about include is ensuring the number of threads is roughly equal to the number of processors, and ensuring there's no lock contention. — pamphlet, Feb 22 '13 at 14:59

score 1 · Answer 1 · edited May 23 '17 at 12:27

1

You may be looking to enable Nagle' algorithm: wikipedia entry.

Here's a link about disabling it that might be helpful: Disabling Nagle's Algorithm in linux

edited May 23 '17 at 12:27

Community

1
1

answered Feb 22 '13 at 14:57

pamphlet

2,054
1
17
27

Thank you, at first glance it looks like it is the option I need. – Feb 22 '13 at 15:04
I've tried to sabotage Fellow's network performance by using java.net.Socket.setTcpNoDelay(true) but total time did not change. – Feb 22 '13 at 15:26
.getTcpNoDelay() on amazon computers returned false. May Nagle's algorithm be implemented on both Java and OS levels? Or maybe I am just looking in the wrong direction. – Feb 22 '13 at 15:39
I believe Nagle is implemented in the socket library (e.g. NIO for Java, Winsock on Windows), not the OS, and not the hardware. But I could be mistaken. – pamphlet Feb 22 '13 at 15:43
I was wondering if you're looking in the wrong direction. Was there direct evidence that network is the bottleneck? Have you run dstat? Maybe it's CPU? Or maybe Fellow has more RAM and Amazon is paging to disk? Can you capture high-precision timing data from your worker runs in both environments that exclude network to rule it in or out? – pamphlet Feb 22 '13 at 15:45
Also, I'm assuming that Amazon is using VMs, and Fellow is not, that could be a factor... – pamphlet Feb 22 '13 at 15:46
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24974/discussion-between-oleg-ostroumov-and-pamphlet) – Feb 22 '13 at 15:47
@pamphlet (1) NIO is just a layer over the C sockets API. Java doesn't provide a TCP implementation of its own. The OS always provides that. (2) TCP_NODELAY is implemented in the operating system, not in the API. – user207421 Oct 14 '13 at 23:21
@Oleg TCP_NODELAY is off by default. Did you *set* it? It isn't likely that Amazon would somehow cause it not to work on their computers, nor is there any motivation that I can see. – user207421 Oct 14 '13 at 23:23

1 million remote function calls per second

1 Answers1