4

I'm trying to write a simple function that reads a series of files and performs some regex search (or just a word count) on them and then return the number of matches, and I'm trying to make this run in parallel to speed it up, but so far I have been unable to achieve this.

If I do a simple loop with a math operation I do get significant performance increases. However, a similar idea for the grep function doesn't provide speed increases:

function open_count(file)
    fh = open(file)
    text = readall(fh)
    length(split(text))
end



tic()
total = 0
for name in files
    total += open_count(string(dir,"/",name))
    total
end
toc()
elapsed time: 29.474181026 seconds


tic()
total = 0
total = @parallel (+) for name in files
    open_count(string(dir,"/",name))
end
toc()

elapsed time: 29.086511895 seconds

I tried different versions but also got no significant speed increases. Am I doing something wrong?

  • 27 seconds to process a file? I'd guess these are fairly big disk files, they won't fit in your processor'd disk cache, and have to be read from the disk each time. Then the best you can hope for is time equal to the time to read both files from the disk. Typically the disk can only read one place at a time --> disk reads are sequential and thus no speedup. – Ira Baxter Jan 23 '14 at 08:01
  • It's not one single file, it's a list of files (almost a GB in total I think). I should have said that. But thanks for that explanation. – Matías Guzmán Naranjo Jan 23 '14 at 08:12
  • I can not test this, because I do not have files of this size to test on. Could you publish a script that generates something with the same structure and size? Your OS is probably taking up most of the time here. Have you considered closing the files in open_count()? – ivarne Jan 23 '14 at 09:25
  • Have you profiled? Doing so will tell you whether the bottleneck is in the I/O or the regex. If it's the former, consider spreading your files across multiple drives. – tholy Jan 23 '14 at 11:25
  • @ivarne closing the files did slightly improve performance, that will be helpful. With this [script](http://pastebin.com/icTLiNZS) you can get a similar looking corpus. – Matías Guzmán Naranjo Jan 23 '14 at 16:41
  • @tholy how do I profile the program? Also, I'm not really interested in optimizing this particular task, but writing fast general functions for working with corpora independent of how it is stored or where. – Matías Guzmán Naranjo Jan 23 '14 at 16:41
  • I know the idea is to do this in julia, but what about using [grep from within julia](http://docs.julialang.org/en/latest/manual/running-external-programs/)? [grep skips bytes](http://stackoverflow.com/a/12630617/178651) which might make it faster, if IO is the bottleneck and you don't want to mess with more drives, etc. And, if you frequently had to grep the same files, maybe could copy to an [in-memory partition](http://www.commandlinefu.com/commands/view/224/mount-a-temporary-ram-partition) before processing them. – Gary S. Weaver Jan 23 '14 at 18:50
  • Related to the in-memory partition idea, note that [the OS may already be caching](http://unix.stackexchange.com/a/40207/23218), so may not be needed. Since that is all about calling out to grep though, it might not be what you are looking for anyway. – Gary S. Weaver Jan 23 '14 at 19:10
  • @MatíasGuzmánNaranjo, see the Profiler documentation in the standard library: http://docs.julialang.org/en/release-0.2/stdlib/profile. – tholy Jan 26 '14 at 11:52

1 Answers1

2

I've had similar problems with R and Python. As others pointed out in the comment, you should start with the profiler.

If the read is taking up the majority of time then there's not much you can do. You can try moving the files to different hard drives and read them in from there. You can also try a RAMDisk kind of solution, which basically makes your RAM look like permanent storage (reducing available ram) but then you can get very fast read and writes.

However, if the time is used to do the regex, than consider the following: Create a function that reads in one file as whole and splits out separate lines. That should be a continuous read hence as fast as possible. Then create a parallel version of your regex which processes each line in parallel. This way the whole file is in memory and your computing cores can munge the data a faster rate. That way you might see some increase in performance.

This is a technique I used when trying to process large text files.

niczky12
  • 4,953
  • 1
  • 24
  • 34