Erlang takes ages to count?

Question

Hi guys I'm trying to make a load generator and my goal is to compare how much of my system's resources are consumed when spawning Erlang processes as compared to spawning threads (Java). I am doing this by having the program count to 1000000000 10 times. Java takes roughly 35 seconds to finish the whole process with 10 threads created, Erlang takes ages with 10 processes, I grew impatient with it because it spent over 4 minutes counting. If I just make Erlang and Java count to 1000000000 without spawning threads/processes, Erlang takes 1 minute and 32 seconds and Java takes a good 3 or so seconds. I know Erlang is not made for crunching numbers but that much of a difference is alarming, why is there such a big difference ? Both use my CPU to 100% but no spike in RAM. I am not sure what other methods can be used to make this comparison, I am open to any suggestions as well.

here is the code for both versions

-module(loop).

-compile(export_all).


start(NumberOfProcesses) ->
loop(0, NumberOfProcesses).

%%Processes to spawn
loop(A, NumberOfProcesses) -> 
if A < NumberOfProcesses -> 
    spawn(loop, outerCount, [0]),
    loop(A+1, NumberOfProcesses);   

    true -> ok
end.

%%outer loop 
outerCount(A) ->
if A < 10 ->

    innerCount(0),
    outerCount(A + 1);

    true -> ok
end.

%%inner loop 
innerCount(A) ->
if A < 1000000000 ->
    innerCount(A+1);

    true -> ok
end.

and java

import java.util.Scanner;

class Loop implements Runnable 
{

public static void main(String[] args) 
{

    System.out.println("Input number of processes");
    Scanner scan = new Scanner(System.in);        
    String theNumber = scan.nextLine();   


    for (int t = 0; t < Integer.parseInt(theNumber); t++)
    {
        new Thread(new Loop()).start();
    }
}

public void run() 
{
    int i;
    for (i = 0; i < 10; i++)
    {
        for (int j = 0; j < 1000000000; j++);
    }
}
}

On the question of compiler optimization: if I add a System.io.print() in that Java loop, it takes at least 10 minutes to run on my 2009 iMac. Gave up at that point (91,545,633 loop iterations). — macintux, Mar 16 '14 at 19:30
Doesn't output stream slow things down a little bit ? I've noticed that if you have a program giving output execution times are affected a little bit. — Question Mcquestioner, Mar 16 '14 at 21:23
Yes, I/O definitely slows things down; the key point is that 3 seconds, or 35 seconds, isn't an accurate measure of the Java execution, just a smart compiler. — macintux, Mar 16 '14 at 23:21
http://stackoverflow.com/a/6967420/3346496 is a very good answer you should look at if you are interested in the performance of the erlang code. — monocell, Mar 17 '14 at 03:13

score 2 · Answer 1 · answered Mar 16 '14 at 19:22

Are you running a 32- or 64-bit version of Erlang? If it's 32 bit, then the inner loop limit 1000000000 won't fit in a single-word fixnum (max 28 bits incl. sign), and the loop will start to do bignum arithmetic on the heap which is way way more expensive than just incrementing a word and looping (it will also cause garbage collection to happen now and then, to get rid of old unused numbers from the heap). Changing the outer loop from 10 to 1000 and removing 2 zeros correspondingly from the inner loop should make it use fixnum arithmetic only even on a 32-bit BEAM.

Then, it's also a question of whether the Java version is actually doing any work at all, or if the loop gets optimized away to a no-op at some point. (The Erlang compiler doesn't do that sort of trick - at least not yet.)

I am running a 32 bit version of Erlang. – Question Mcquestioner Mar 17 '14 at 12:24 — Question Mcquestioner, Mar 17 '14 at 12:24

score 2 · Answer 2 · answered Mar 16 '14 at 22:14

2

RichardC answer gives some clue to understand the difference of execution time. I will add also that if your java code is compiled, it may benefits a lot from the predictive branching of the microprocessor, and thus make a better use of the cache memories.

But the more important in my opinion is that you are not choosing the right ratio of Process/processing to evaluate the cost of process spawning.

The test use 10 processes that does some significant work. I would have chosen a test where many processes are spawned (some thousands? I don't know how much threads the JVM can manage) each process doing very few things, for example this code which spawn at each step twice the number of process and wait for the deepest processes to send back the message done. With a depth of 17, which means 262143 processes in total and 131072 returned messages, it takes less than 0.5 s on my very slow PC, that is less than 2µs per process (of course the dual core dual thread should be used)

-module (cascade).

-compile([export_all]).


test() ->
    timer:tc(?MODULE,start,[]).

start() ->
    spawn(?MODULE,child,[self(),17]),
    loop(1024*128).

loop(0) -> done;
loop(N) ->
    receive
        done -> loop(N-1)
    end.

child(P,0) -> P ! done;
child(P,N) ->
    spawn(?MODULE,child,[P,N-1]),
    spawn(?MODULE,child,[P,N-1]).

answered Mar 16 '14 at 22:14

Pascal

13,977
2
24
32

I think the code you wrote just now is what I may be looking for, would you mind if I used it ? I have a few questions about it though why 1024*128 ? why not just write 131072 ? and how do 262143 processes return 131072 messages ? also shouldn't it be 262142 ? I calculated this and it comes up to that. "of course the dual core dual thread should be used" meaning ? I am a beginner at most of this so please forgive my ignorance. – Question Mcquestioner Mar 17 '14 at 12:11
Another question, is there a way I can emulate the same thing in java ? I want to have an equal test for both. – Question Mcquestioner Mar 17 '14 at 12:16
I figured out why it's 131072, never mind the question. – Question Mcquestioner Mar 17 '14 at 12:23
No problem to use this code :o), the remark about dual core and dual thread is just a warning that on my PC the program took a little less than 0.5 second, but as the VM uses all the available thread and share the bandwidth with many other windows processes (at least 2 pdf, 4doc,firefox, excel...) the real computation time is something between 0 and 2seconds. And **sorry** the only thing I have coded in java is a calculator so I haven't any added value. – Pascal Mar 17 '14 at 13:06
Thank you for your input and permission. I will try to emulate this in java to see what I get. Creating a tree of threads shouldn't be that hard (I hope). – Question Mcquestioner Mar 17 '14 at 13:56

macintux · Answer 3 · 2014-03-16T19:51:02.440

There are a few problems here.

I don't know how you can evaluate what the Java compiler is doing, but I'd wager it's optimizing the loop out of existence. I think you'd have to have the loop do something meaningful to make any sort of comparison.

More importantly, the Erlang code is not doing what you think it's doing, as best as I can tell. It appears that each process is counting up to 1000000000, and then doing it again for a total of 10 times.

Perhaps worse, your functions are not tail recursive, so your functions keep accumulating in memory waiting for the last one to execute. (Edit: I may be wrong about that. Unaccustomed to the if statement.)

Here's Erlang that does what you want it to do. It's still very slow.

-module(realloop).
-compile(export_all).

start(N) ->
    loop(0, N).

loop(N, N) ->
    io:format("Spawned ~B processes~n", [N]);
loop(A, N) ->
    spawn(realloop, count, [0, 1000000000]),
    loop(A+1, N).

count(Upper, Upper) ->
    io:format("Reached ~B~n", [Upper]);
count(Lower, Upper) ->
    count(Lower+1, Upper).

Your code does seem shorter and better in general. I am not a pro at Erlang, I just started doing it about a month ago. Yes, I want it to count to 1000000000 10 times just like what java is doing. I thought my functions would still count as tail recursive if it is the last one to be called regardless of weather it's in an if or not — Question Mcquestioner, Mar 16 '14 at 21:31

Erlang takes ages to count?

3 Answers3