3

The task I have in hand is to read the lines of large file, process them, and return ordered results.

My algorithm is:

  1. start with master process that will evaluate the workload (written in the first line of the file)
  2. spawn worker processes: each worker will read part of the file using pread/3, process this part, and send results to master
  3. master receives all sub-results, sort, and return so basically no communication needed between workers.

My questions:

  1. How to find the optimal balance between the number of erlang processes and the number of cores? so if I spawn one process for each processor core I have would that be under utilizing of my cpu?
  2. How does pread/3 reach the specified line; does it iterate over all lines in file ? and is pread/3 a good plan to parallel file reading?
  3. Is it better to send one big message from process A to B or send N small messages? I have found part of the answer in the below link, but I would appreciate further elaboration
    erlang message passing architecture
Community
  • 1
  • 1
  • 1
    Is the file memory mapped? If not, then having multiple actors randomly access different parts of the file may [slow you down](http://stackoverflow.com/questions/17220892/read-the-30million-user-ids-one-by-one-from-the-big-file/17220973#17220973), since only one actor can read the disk at any given moment, and if you've got a magnetic disk then each actor will trigger a disk seek. I recommend having the master map all (or most, if the file is large) of the file into memory, then having the workers operate on this memory mapped (partial) file. – Zim-Zam O'Pootertoot Jun 10 '15 at 13:49
  • 2
    @Zim-ZamO'Pootertoot is right. Generally the way I approach this is to have the "master" read the file line by line (or segment by segment), then distribute the lines to a number of "workers" for processing. This can be very fast if you are reading the file as binary (as opposed to list) strings, as binaries greater than 64-bytes in size are reference-counted and have a low copy-overhead when being sent in a message to another process. – Soup d'Campbells Jun 10 '15 at 15:49
  • 1
    In principle I agree. However there has been a big buzz in 2007 over slow sequntial io in erlang and a [line_sever module](https://github.com/dcaoyuan/snippet/blob/master/widefinder/src/line_server_modified.erl). which reads the file concurrently. I am wondering now after 8 years what is the best way to do it. – ErlangNewbie Jun 11 '15 at 09:14

1 Answers1

1
  1. Erlang processes are cheap. You're free (and encouraged) to use more than however many cores you have. There might be an upper limit to what is practical for your problem (loading 1TB of data in one process per line is asking a bit for much, depending on line size).

    The easiest way to do it when you don't know is to let the user decide. This means you could decide to spawn N workers, and distribute work between them, waiting to hear back. Re-run the program while changing N if you don't like how it runs.

    Trickier ways to do it is to benchmark a bunch of time, pick what you think makes sense as a maximal value, stick it in a pool library (if you want to; some pool go for preallocated resources, some for a resizable amount), and settle for what would be a one-size-fits-all solution.

    But really, there is no easy 'optimal number of cores'. You can run it on 50 processes as well as on 65,000 of them if you want; if the task is embarrassingly parallel, the VM should be able to make usage of most of them and saturate the cores anyway.

-

  1. Parallel file reads is an interesting question. It may or may not be faster (as direct comments have mentioned) and it may only represent a speed up if the work on each line is minimal enough that reading the file has the biggest cost.

    The tricky bit is really that functions like pread/2-3 takes a byte offset. Your question is worded such that you are worried about lines of the file. The byte offsets you hand off to workers may therefore end up straddling a line. If your block ends up at the word my in this is my line\nhere it goes\n, one worker will see itself have an incomplete line, while the other will report only on my line\n, missing the prior this is.

    Generally, this kind of annoying stuff is what will lead you to have the first process own the file and sift through it, only to hand off bits of text to process to workers; that process will then act as some sort of coordinator.

    The nice aspect of this strategy is that if the main process knows everything that was sent as a message, it also knows when all responses have been received, making it easy to know when to return the results. If everything is disjoint, you have to trust both the starter and the workers to tell you "we're all out of work" as a distinct set of independent messages to know.

    In practice, you'll probably find that what helps the most will be to know do operations that help the life of your hardware regarding file operations, more than "how many people can read the file at once". There's only one hard disk (or SSD), all data has to go through it anyway; parallelism may be limited in the end for the access there.

-

  1. Use messages that make sense for your program. The most performant program would have a lot of processes able to do work without ever needing to pass messages, communicate, or acquire locks.

    A more realistic very performant program would use very few messages of a very small size.

    The fun thing here is that your problem is inherently data-based. So there's a few things you can do:

    • make sure you read text in a binary format; large binaries (> 64b) get allocated on a global binary heap, are shared around and GC'd with reference counting
    • Hand in information on what needs to be done rather than the data for doing it; this one would need measuring, but the lead process could go over the file, note where lines end, and just hand byte offsets to the workers so they can go and read the file themselves; do note that you'll end up reading the file twice, so if memory allocation is not your main overhead, this will likely be slower
    • Make sure the file is read in raw or ram mode; other modes use a middle-man process to read and forward data (this is useful if you read files over a network in clustered Erlang nodes); raw and ram modes gives the file descriptor directly to the calling process and is a lot faster.
    • First worry about writing a clear, readable and correct program. Only if it is too slow should you attempt to refactor and optimize it; you may very well find it good enough on the first try.

I hope this helps.

P.S. You can try the really simple stuff at first:

  1. either:

    • read the whole file at once with {ok, Bin} = file:read_file(Path) and split lines (with binary:split(Bin, <<"\n">>, [global])),
    • use {ok, Io} = file:open(File, [read,ram]) and then use file:read_line(Io) on the file descriptor repeatedly
    • use {ok, Io} = file:open(File, [read,raw,{read_ahead,BlockSize}]) and then use file:read_line(Io) on the file descriptor repeatedly
  2. call rpc:pmap({?MODULE, Function}, ExtraArgs, Lines) to run everything in parallel automatically (it will spawn one process per line)

  3. call lists:sort/1 on the result.

Then from there you can refine each step if you identify them as problematic.

I GIVE TERRIBLE ADVICE
  • 9,578
  • 2
  • 32
  • 40
  • Thanks that was great. After few tests I have read the file all at once as a binary and split the work using binary:part/2 since the lines are of fixed size. I have also allocated a pool of processes and it scales really nice but it's still a bit slow, I'll try to optimize using your notes, after I finish I'll publish the code and speedup graphs here – ErlangNewbie Jun 17 '15 at 14:40