9

I have 100 servers in my cluster.

At time 17:35:00, all 100 servers are provided with data (of size 1[MB]). Each server processes the data, and produces an output of about 40[MB]. The processing time for each server is 5[sec].

At time 17:35:05 (5[sec] later), there's a need for a central machine to read all the output from all 100 servers (remember, the total size of data is: 100 [machines] x 40 [MB] ~ 4[GB]), aggregate it, and produce an output.

It is of high importance that the entire process of gathering the 4[GB] data from all 100 servers takes as little time as possible. How do I go about solving this problem?

Are there any existing tools (ideally, in python, but would consider other solutions) that can help?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user3262424
  • 7,223
  • 16
  • 54
  • 84

5 Answers5

5

Look at the flow of data in your application, and then look at the data rates that your (I assume shared) disk system provides and the rate your GigE interconnect provides, and the topology of your cluster. Which of these is a bottleneck?

GigE provides theoretical maximum 125 MB/s transmission rate between nodes - thus 4GB will take ~30s to move 100 40MB chunks of data into your central node from the 100 processing nodes over GigE.

A file system shared between all your nodes provides an alternative to over-Ethernet RAM to RAM data transfer.

If your shared file system is fast at the disk read/write level (say: a bunch of many-disk RAID 0 or RAID 10 arrays aggregated into a Lustre F/S or some such) and it uses 20Gb/s or 40 Gb/s interconnect btwn block storage and nodes, then 100 nodes each writing a 40MB file to disk and the central node reading those 100 files may be faster than transferring the 100 40 MB chunks over the GigE node to node interconnect.

But if your shared file system is a RAID 5 or 6 array exported to the nodes via NFS over GigE Ethernet, that will be slower than RAM to RAM transfer via GigE using RPC or MPI because you have to write and read the disks over GigE anyway.

So, there have been some good answers and discussion or your question. But we do (did) not know your node interconnect speed, and we do not know how your disk is set up (shared disk, or one disk per node), or whether shared disk has it's own interconnect and what speed that is.

Node interconnect speed is now known. It is no longer a free variable.

Disk set up (shared/not-shared) is unknown, thus a free variable.

Disk interconnect (assuming shared disk) is unknown, thus another free variable.

How much RAM does your central node have is unknown (can it hold 4GB data in RAM?) thus is a free variable.

If everything including shared disk uses the same GigE interconnect then it is safe to say that 100 nodes each writing a 40MB file to disk and then the central node reading 100 40MB files from disk is the slowest way to go. Unless your central node cannot allocate 4GB RAM without swapping, in which case things probably get complicated.

If your shared disk is high performance it may be the case that it is faster for 100 nodes to each write a 40MB file, and for the central node to read 100 40MB files.

Eric M
  • 1,027
  • 2
  • 8
  • 21
  • thank you for this detailed response. the truth is, I still don't have the system built; I need to decide on the specs; I only described the problem in order to get a sense of how to build the system so that it can handle my problem; Do you have any specific recommendation what to do / not do when building the system? – user3262424 May 19 '11 at 22:07
  • 1
    Well, you ask a many-sided question, and not one I can answer quickly. Send me an email if you want to emerth@hotmail.com subject line 'Cluster Question' and we can kick it around a bit if you want via email. Whatever we come up with you can post back here if you want, it's OK with me. – Eric M May 19 '11 at 22:28
  • 1
    Then again, taking this out of SO is not really fair because other people have contributed and would probably contribute more. So disregard my comment about email conversation entirely because I was wrong to make such a suggestion. I'll ask these questions to start with: (1) does your master process need the entire 4GB of result data before it can start working on the results; (2) how small is a unit of result data; (3) are you willing to spend perhaps 5x the money on network adapters and switches (IB costs more than Ethernet) or would this be hard for you to justify. – Eric M May 19 '11 at 22:39
3

Use rpyc. It's mature and actively maintained.

Here's their blurb about what it does:

RPyC (IPA:/ɑɹ paɪ siː/, pronounced like are-pie-see), or Remote Python Call, is a transparent and symmetrical python library for remote procedure calls, clustering and distributed-computing. RPyC makes use of object-proxying, a technique that employs python's dynamic nature, to overcome the physical boundaries between processes and computers, so that remote objects can be manipulated as if they were local.

David Mertz has a quick introduction to RPyC at IBM developerWorks.

Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119
  • @Steven Rumbalski: thank you. How long do you think it should take the central machine to read the 4[GB] data? – user3262424 May 19 '11 at 17:51
  • @user3262424 You won't know until you try it and time it. Make a function on the remote machine that yields the data (probably in large chunks, perhaps in one big chunk, definitely not line by line). Then call that function from the local machine. It should be relatively easy to set that up. – Steven Rumbalski May 19 '11 at 17:56
  • @Steven Rumbalski: is there a way to send the data from the memory on the `remote machine` straight to the memory of the `central machine`, with no read/write to disk? – user3262424 May 19 '11 at 18:17
  • 1
    RPC methods should not hit the disk, nor should MPI methods (unless you run out of RAM and start swapping hard). Also, if you are using Infiniband for your node interconnect, you should be able to make it do remote DMA, which is faster than traversing entire network stack. – Eric M May 19 '11 at 18:31
  • @user3262424 Yes. Assuming the remote procedure does not write to disk, no data will be written to disk. – Steven Rumbalski May 19 '11 at 18:34
  • @user3262424 - If you do an RPC and retrieve data from a variable(s) in RAM and store it in variable(s) in RAM then the data never hits the disk. RPC is just a remoted procedure call - it deals with data stored in RAM. If you retrieve data amounting to some large fraction of the RAM in the node at the receiving end, then you may run out of RAM and the OS will be swapping. But you are moving only 4GB total - if the central node has 8GB RAM or more you should be fine. – Eric M May 19 '11 at 19:16
  • @user3262424 - Also if you are really really serious about avoiding ever hitting the disk with that 4GB of product data, look at the man page for mlock() (I'm assuming assuming you are Linux). See http://stackoverflow.com/questions/3245406/allocating-largest-buffer-without-using-swap – Eric M May 19 '11 at 19:19
  • @Eric M: suppose my process (in `Python`) is running on the `remote machine`. How do I save the results of the processing in RAM, for later retrieval? and, when accessing the `remote machine` from the `central machine`, how do I refer to this (stored) value? – user3262424 May 19 '11 at 19:58
  • @Steven Rumbalski: feel free to answer the same comment above (to Eric M). SO does not let me edit the comment, hence I am adding a new one. – user3262424 May 19 '11 at 20:05
  • @user3262424 - allocate a list or dictionary or... sufficient to hold your 40MB of data in the program that runs on the compute nodes. Store your results in that list or dictionary. The list or dictionary is in RAM. Use RPC or MPI methods to have the central node invoke the calculation on the compute nodes. When the computation on each node is done... RPC call on central node returns the data... or MPI methods inform your central node program it is time to recover the result. – Eric M May 19 '11 at 21:13
2

What's your networking setup ? If your central machine is connected to the cluster by a single gigabit link, it's going to take you at least ~30s to copy the 4GByte to it (and that's assuming 100% efficiency and about 8s per gigabyte, which I've never seen).

timday
  • 24,582
  • 12
  • 83
  • 135
2

Experiment! Other answers have included tips on what to experiment with, but you might solve the problem the most straight-forward way and use that as your baseline.

You have 1meg producing 40meg of output on each server - experiment with each server compressing the data to be sent. (That compression might be free-ish if compression is part of your file system).

Latency - it is never zero.

Can you change your algorithms?

Can you do some sort of hierarchical merging of outputs rather than one CPU doing all 4Gigs at once? (Decimation in time).

It is possible to buy quad socket servers with 80 cores - would that be quicker as storage could be local, and you might configure the one machine with a lot of ram.

Paddy3118
  • 4,704
  • 27
  • 38
1

Can you write your code using a Python binding to MPI? MPI has facility for over the wire data transmission from M nodes to N nodes, M,N>=1.

Also, as mentioned above you could write the data to 100 files on a shared filesystem, then read the files on the 'master' node.

Eric M
  • 1,027
  • 2
  • 8
  • 21
  • 1
    If possible, I would avoid writing the 4GB files to disk as you now have to contend with network time + time to write to disk + time to read from disk. – Steven Rumbalski May 19 '11 at 18:04
  • 1
    Good point. However we do not know what kind of node interconnect and disk system he is using in his cluster, so there are a few free variables in any analysis of his problem. – Eric M May 19 '11 at 18:28
  • currently I am using 1G Ethernet. If there is a reason to upgrade, I might consider it. – user3262424 May 19 '11 at 20:02
  • @user3262424 - See my comment/reply below, which will be there as soon as I finish typing it. – Eric M May 19 '11 at 20:20
  • 1
    Upgrade from GigE... do some reading on cluster programming models and systems. 10GigE is ten times faster but 10Gig Infininiband has lower latency. 20Gig and 40 Gig Infiniband can be had. But raw speed and latency can be traded off against each other, and it depends on how you structure your program. Say each of the 100 nodes produces 1024^2 40 byte results and the master node can process them quickly in any order - then MPI and low latency interconnect could be the ticket. If the master node needs all 4GB before it can do anything then raw speed probably matters more. – Eric M May 19 '11 at 21:01