Massive Performance Problem - Using Channels in Julia

Question

Summary

Benchmarking times for Channels in Julia - using a ~5GB tsv file

Baseline: Bash tools (cat, grep - baseline written in C)
- ~ 2 seconds
Julia: Simple loop with eachline
- ~ 4-5 seconds (2nd-run, not pre-compilation, etc)
Julia Channel implementation
- ~ 11 seconds (2nd-run, not pre-compilation, etc)

Also:

Pure Python
- ~ 4-5 seconds

Longer Explaination

I have been working towards making the most performant/standard type of multiprocessing design pattern wherein data is either streamed from disk or a download stream, pieces are fed to all cores on the system, and then the output from this is serialized to disk. This is obviously a hugely important design to get right, since most programming tasks fall within this description.

Julia seems like a great choice for this due to it's supposed ability to be performant.

In order to serialize the IO to/from disk or download and then to send data to each processor, Channels seem to be the suggested choice by Julia.

However, my tests so far seem to indicate that this is extremely non-performant.

The simplest example displays just how exceedingly slow Channels (and Julia!) are at this. It's been very disappointing.

A simple example of grep and cat (removing multiprocessing bits for clarity):

Julia code:

using CodecZlib: GzipDecompressorStream
using TranscodingStreams: NoopStream

    
"""
  A simple function to "generate" (place into a Channel) lines from a file
  - This mimics python-like behavior of 'yield'
"""
function cat_ch(fpath)
  Channel() do ch
    codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
    open(codec, fpath, "r") do stream
      for (i, l) in enumerate(eachline(stream))
        put!(ch, (i, l))
      end
    end
  end
end


function grep_ch(line_chnl, searchstr)
  Channel() do ch
    for (i, l) in line_chnl
      if occursin(searchstr, l)
          put!(ch, (i, l))
      end
    end
  end
end

function catgrep_ch(fpath, search)
  for (i, l) in grep_ch(cat_ch(fpath), search)
      println((i, l))
  end
end

function catgrep(fpath, search)
  codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
  open(codec, fpath, "r") do stream
    for (i, l) in enumerate(eachline(stream))
      if occursin(search, l)
        println((i,l))
      end
    end
  end
end

if abspath(PROGRAM_FILE) == @__FILE__
  fpath = ARGS[1]
  search = ARGS[2]
  catgrep_ch(fpath, search)
end

Performance Benchmarks

1) Baseline:

user@computer>> time (cat bigfile.tsv | grep seachterm)

real    0m1.952s
user    0m0.205s
sys 0m2.525s

3) Without Channels (Simple) in Julia:

julia> include("test1.jl")

julia> @time catgrep("bigfile.tsv", "seachterm")
  4.448542 seconds (20.30 M allocations: 10.940 GiB, 5.00% gc time)

julia> @time catgrep("bigfile.tsv", "seachterm")
  4.512661 seconds (20.30 M allocations: 10.940 GiB, 4.87% gc time)

So, it's like 2-3x worse, in the most simplistic possible case. Nothing fancy is done here at all, and it isn't due to pre-compilation.

3) Channels in Julia:

julia> @time catgrep_ch("bigfile.tsv", "seachterm")
 11.691557 seconds (65.45 M allocations: 12.140 GiB, 3.06% gc time, 0.80% compilation time)

julia> @time catgrep_ch("bigfile.tsv", "seachterm")
 11.403931 seconds (65.30 M allocations: 12.132 GiB, 3.03% gc time)

This is really horrible, and I'm not sure how it becomes so sluggish.

Is the way in which Channels are used here wrong?

2

can you share a sample file? – jling Jul 29 '22 at 05:14

score 2 · Answer 1 · answered Aug 01 '22 at 00:22

Julia, grep and Python uses different algorithms when it comes to string searching. There are many algorithm and some are far better than others in specific cases.

grep is highly optimized so to run quickly in many situation including in your specific use-case. Indeed, according to the GNU documentation, the Boyer-Moore fast string searching algorithm is used to match a single fixed pattern, and the Aho-Corasick algorithm is used to match multiple fixed patterns. In your specific use-case, Boyer-Moore is select and it is generally fast since it can skip part of the input based on the searched string. Its best-case complexity is Ω(n/m) and its worst-case complexity is O(mn). It is extremely fast if the text rarely contains characters of the searched string. For example, searching seachterm in this is a test with a pretty long sentence (repeated 58.5 million times) is 10 times faster than searching iss while both are not present in the target file. This is because Boyer-Moore search for the last letter of the searched string (a m) in the text and cannot find it so it can be very fast. There are other reasons explaining why grep is so fast compared to most alternative methods. One of them is that grep do not create/allocate sub-strings for each lines and use a huge raw buffer instead. Note that cat bigfile.tsv | grep seachterm can be significantly slower than grep seachterm bigfile.tsv since the pipe introduce a significant overhead when the parsing is fast enough.

CPython uses a mix of different algorithms so be efficient in most cases. Based on the implementation, they use a mix of the Boyer-Moore algorithm "incorporating ideas of Horspool and Sunday". They claim the resulting algorithm to be faster than other algorithm like Knuth-Morris-Pratt for example. For long strings, they use an even faster algorithm that is very efficient: the Crochemore and Perrin's Two-Way algorithm (a mix of BM and KMP). This one runs in O(n+m) in the worst case which is optimal. Note that while this implementation is great, splitting lines of a file and creating many string object can significantly decrease the performance. THis is certainly why your python implementation is not so fast compared to grep.

In Julia code, the file splitting in lines which introduces a significant overhead and put a pressure on the garbage-collector. Furthermore, occursin do not seems particularly optimized. There is no comment in the code about which algorithm is used. That being said, it looks like a naive generic brute-force algorithm running it O(mn) time. Such a code cannot compete with optimized implementations of efficient algorithms like the one in Python and grep.

Channels are a bit similar to coroutines and fibers (or any "light threads") with a FIFO queue so to manage messages. Such a construct introduces a significant overhead due to expensive software-defined context-switches (aka yield that mainly consists in saving/restoring some registers). The negative effect on performance can be delayed. Indeed, light threading systems have their own stack and they own code context. Thus, when the processor do a light-thread context switch, this can cause data/code cache-misses. For more information about how channels you can read the documentation about it (which mention an embedded task scheduler) or directly read the code.

In addition, channels create objects/message than needs to be managed by the garbage collector putting even more pressure on it. In fact, the number of allocation is >3 times bigger in the channel based version. One can argue that the reported GC overhead is low but such metrics often underestimate the overall overhead which include allocations, memory diffusion/fragmentation, GC collections, cache-effects, etc. (and, in this case, even I/O overlapping effects).

I think the main problem with the channel-based implementation is that the channel of your code are unbuffered (see the documentation about it). Using wide buffers can help to significantly reduce the number of context-switches and so the overhead. This may increase the latency but there is often a trade-off to make between latency and throughput (especially in scheduling). Alternatively, note that there are some packages that can be faster than built-ins channels.

score 0 · Answer 2 · answered Jun 22 '23 at 09:43

Buffering channels really speeds up this code, however the main problem remains: one should not use channels like this. The processing time/transferring time ratio is too small for this task to be reasonable, so most of time is spent on transferring data over channel and allocating/freeing memory.

Although here's some improvements of code:

Add some buffer to channel Channel(f, buffer_size)
Perform operations asynchronously. Channel(f, buffer_size; spawn=true)
Provide type stability Channel{Tuple{Int,String}}(f, buffer_size; spawn=true)

So assuming you are reading large lines from file and perform several high loaded steps over them, that could be performed concurrently you could implement it like this

function step1(ch_out)
    open(file, "r") do stream
        for l in eachline(stream)
            res = process1(l)
            put!(ch_out, res)
        end
    end
end

function step2(ch_in, ch_out)
    for el in ch_in
        res = process2(el)
        put!(ch_out, res)
    end
end

...

function main(args)
    ch1 = Channel{TYPE1}(step1, bufsize1; spawn=true)
    ch2 = Channel{TYPE2}(step2, bufsize2; spawn=true)
    ...
    ch_end = Channel{TYPE_END}(step_end, bufsize_end; spawn=true)
    for el in ch_end
        process_end(el)
    end # or simply collect(ch_end)
end

You can gain even more control over how tasks would be spawned, scheduled and so on with Distributed.

Finally you could profile this code somehow to track usage of each channel's buffer and optimize its size to obtain a non-stop conveyor.

Przemyslaw Szufel · Answer 3 · 2022-07-29T08:16:24.353

-1

Edit (with regard to new info from @chase)

@chase as far as I understand you are comparing performance of yield in Python which is a generator for non materialized lists vs a Channel in Julia which is a FIFO queue which supports for multi-threaded insertion and polling of elements. In this case this you are comparing two very different things (like apples-to-oranges).

If your goal is the implementation of processing similar in ideas to grep have a look at the performance tips below.

Performance tips

Channel will add a big overhead like any additional communication layer. If you need performance you need to:

Use either @distributed or Threads.@threads to create parallel workers
Each worker opens the file for reading
Use seek to allocate their location (e.g. having a 1000 byte of file and 2 workers the first one starts at byte 0 and the second does seek(500).
Remember to implement mechanism in such a way that you handle situation that your worker gets data in the middle of the line
Operate directly on raw bytes rather than String (for performance)

edited Jul 29 '22 at 08:16

answered Jul 29 '22 at 05:10

Przemyslaw Szufel

40,002
3
32
62

2

These are good perfomance tips, but it's not an explanation of why the Julia version is slower than python here. – Oscar Smith Jul 29 '22 at 05:13
There is nothing said how the Python implementation is made. `grep` is only given as reference and this would be the way to implement `grep` in Julia. As far as I know Python does not even have `Channel` class - there is `Queue` and `Pipe` in `multiprocessing` module. Moreover eg, Python implementation can read strings as ASCII opposed to Julia reading here Unicode. Depending on the implementation performance bottleneck can be at a quite different place. – Przemyslaw Szufel Jul 29 '22 at 05:23
Some of these make some sense (Although I would hope to not see *large* performance hits from using String rather than bytes); however, I don't understand (2). Assuming reading from a spinning disk, I believe that you should have *one* serialized IO thread - to avoid disk thrashing. I realize I could be wrong about this, and it likely will not hold for NVME/SSD, but I recall learning that reading from (spinning-plate) disk should not be multi-processed (and preferably threading kept minimal for disk-thrash). Am I wrong about that? – chase Jul 29 '22 at 05:29
@PrzemyslawSzufel The python implementation is very simple. It's simply `with open(fpath, "r") as f: for i, l in enumerate(f.readlines()): if search in l: print((i,l))` with extra fluff for args and such. I also checked if there are two generator functions yielding output, similar to the Julia Channels. Both were approximately the same in performance. – chase Jul 29 '22 at 05:35
1

@chase I edited the answer including your comment. Regarding disks - current drives are SSD so there is no cost of switching between sectors and there are nowadays lots of buffering mechanism. There is always the question what is the bottleneck CPU or IO. Regarding Strings - in Julia strings are UTF-8-encoded, which is a variable-length encoding and in practical HPC scenarios you could better performance with bytes. Look for an example what datatypes are used by CSV.jl for high performance. – Przemyslaw Szufel Jul 29 '22 at 08:23