Summary
Benchmarking times for Channels in Julia - using a ~5GB tsv file
- Baseline: Bash tools (cat, grep - baseline written in C)
- ~ 2 seconds
- Julia: Simple loop with eachline
- ~ 4-5 seconds (2nd-run, not pre-compilation, etc)
- Julia Channel implementation
- ~ 11 seconds (2nd-run, not pre-compilation, etc)
Also:
- Pure Python
- ~ 4-5 seconds
Longer Explaination
I have been working towards making the most performant/standard type of multiprocessing design pattern wherein data is either streamed from disk or a download stream, pieces are fed to all cores on the system, and then the output from this is serialized to disk. This is obviously a hugely important design to get right, since most programming tasks fall within this description.
Julia seems like a great choice for this due to it's supposed ability to be performant.
In order to serialize the IO to/from disk or download and then to send data to each processor, Channels seem to be the suggested choice by Julia.
However, my tests so far seem to indicate that this is extremely non-performant.
The simplest example displays just how exceedingly slow Channels (and Julia!) are at this. It's been very disappointing.
A simple example of grep and cat (removing multiprocessing bits for clarity):
Julia code:
using CodecZlib: GzipDecompressorStream
using TranscodingStreams: NoopStream
"""
A simple function to "generate" (place into a Channel) lines from a file
- This mimics python-like behavior of 'yield'
"""
function cat_ch(fpath)
Channel() do ch
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
put!(ch, (i, l))
end
end
end
end
function grep_ch(line_chnl, searchstr)
Channel() do ch
for (i, l) in line_chnl
if occursin(searchstr, l)
put!(ch, (i, l))
end
end
end
end
function catgrep_ch(fpath, search)
for (i, l) in grep_ch(cat_ch(fpath), search)
println((i, l))
end
end
function catgrep(fpath, search)
codec = endswith(fpath, ".gz") ? GzipDecompressorStream : NoopStream
open(codec, fpath, "r") do stream
for (i, l) in enumerate(eachline(stream))
if occursin(search, l)
println((i,l))
end
end
end
end
if abspath(PROGRAM_FILE) == @__FILE__
fpath = ARGS[1]
search = ARGS[2]
catgrep_ch(fpath, search)
end
Performance Benchmarks
1) Baseline:
user@computer>> time (cat bigfile.tsv | grep seachterm)
real 0m1.952s
user 0m0.205s
sys 0m2.525s
3) Without Channels (Simple) in Julia:
julia> include("test1.jl")
julia> @time catgrep("bigfile.tsv", "seachterm")
4.448542 seconds (20.30 M allocations: 10.940 GiB, 5.00% gc time)
julia> @time catgrep("bigfile.tsv", "seachterm")
4.512661 seconds (20.30 M allocations: 10.940 GiB, 4.87% gc time)
So, it's like 2-3x worse, in the most simplistic possible case. Nothing fancy is done here at all, and it isn't due to pre-compilation.
3) Channels in Julia:
julia> @time catgrep_ch("bigfile.tsv", "seachterm")
11.691557 seconds (65.45 M allocations: 12.140 GiB, 3.06% gc time, 0.80% compilation time)
julia> @time catgrep_ch("bigfile.tsv", "seachterm")
11.403931 seconds (65.30 M allocations: 12.132 GiB, 3.03% gc time)
This is really horrible, and I'm not sure how it becomes so sluggish.
Is the way in which Channels are used here wrong?