9

Sorry that I'm an Erlang newbie and may make stupid question. But please help me to solve the issue.

I have written an Erlang server to replace the one I'm using with Node.js, which ate all my memory and I'm praying that Erlang could be a way out. The server works properly under unit test and internal testing, but face a high CPU usage in stress test.

After trimming down, I found that the CPU burst was due to the TCP receive from clients.

receiveClientPacket(Sock) ->
  inet:setopts(Sock, [{active, once}, {buffer, ?CLIENTHEARTBEATSIZE}]),
  receive
    {tcp, Sock, Data} ->
      {ok, Data};
    {tcp_closed, Sock} ->
      {error, closed}
    after ?CLIENTRECCEIVETIMEOUT ->
      {error, timeout}
  end.

I tried making the process sleep for 10 hours at the beginning of the function (to prevent it from calling receive), the CPU didn't burst at all. Therefore I conclude that the burst of CPU is due to TCP receive. (Please correct me if I made any mistake)

Here are information about my stress test:

  1. start the Erlang server with: erl +zdbbl 2097151 -K true +A 128 +P
  2. 5000000 connect 5000 clients to the Erlang server each connected
  3. client sends a 2 byte data to the server every 1 min after all the
  4. connections is done, (i.e. only the 2 byte data per min are performing), the CPU burst to ~30%sy (from "top")

I'm using an Amazon Linux AMI (large instance, 64-bit) for the Erlang server. Is the burst due to the linux? As I have no idea how the system will use up the CPU. Or is it my poor code's problem? (I believe so...)

In real situation, our servers don't only receive ping pong, but also messages, which is a lot more loading... This is only the first step...

Millions of thanks to anyone who can save me.

Anita~*

~~~~~~~~~~~~~~~~~~~~~~~

Information about large instance (for reference):

  • 7.5 GB memory
  • 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
  • 850 GB instance storage
  • 64-bit platform
  • I/O Performance: High
sunnyrjuneja
  • 6,033
  • 2
  • 32
  • 51
Anita
  • 119
  • 5
  • 1
    Is there one single process running the code above? – Tilman Oct 11 '12 at 05:08
  • I add a new process for each incoming connection, therefore for each connection, i'll keep looping on the function on its process – Anita Oct 11 '12 at 06:10
  • One thing to keep in mind is that if you stop the TCP reception, you also stop any processing going on afterwards. But since you are spending system time and not userland time, it seems to be the kernel which is on work here. You should probably hunt for why that is the case. – I GIVE CRAP ANSWERS Oct 11 '12 at 09:35
  • Setting the maximum process count to 5000000 processes (`+P 5000000`) is unnecessary in your case, it will increase process size but not affect CPU. – rvirding Oct 11 '12 at 13:25
  • This means you are receiving about 85 2-byte messages per sec, which is not much. Are you doing anything else in the loop code apart from calling `receiveClientPacket/1`? – rvirding Oct 11 '12 at 13:26
  • rvirding: no, i've already trimmed the program to call this only – Anita Oct 11 '12 at 14:01
  • I'm thinking if it is the OS problem??? – Anita Oct 11 '12 at 14:13
  • Hello, did you look at the clients, you don't explain where and how they are implemented (even if the 10 hours nap may be there to test this). – Pascal Oct 16 '12 at 13:14
  • Maybe there are other, not matching messages on the queue? What's the size of receiving process queue? Isn't it a selective receive problem by any chance? – Wacław Borowiec Dec 25 '12 at 20:33

1 Answers1

2

This article about Building a non-blocking TCP server using OTP principles might be an interesting read for you. You also could have a look at Ranch which is used by Cowboy the Erlang HTTP server which can support a very large number of connections keeping a low memory footprint.

Tilman
  • 2,015
  • 14
  • 16
  • I've just tried running the stress test on the program in the article, however, the CPU also burst to 100%. I think it's either the OS or configuration problem... – Anita Oct 12 '12 at 04:55