4

I have a GroupedDataFrame in Julia 1.4 (DataFrames 0.22.1). I want to iterate over the groups of rows to compute some statistics. Because there are many groups and the computations are slow, I want to do this multithreaded.

The code

grouped_rows = groupby(data, by_index)
for group in grouped_rows
    # do something with `group`
end

works, but

grouped_rows = groupby(data, by_index)
Threads.@threads for group in grouped_rows
    # do something with `group`
end

results in MethodError: no method matching firstindex(::GroupedDataFrame{DataFrame}). Is there a way to parallelize the iteration over groups of DataFrame rows?

Miklós Koren
  • 158
  • 2
  • 8

1 Answers1

4

You need to have an AbstractVector for Threads.@threads to work.

Hence collect your grouped_rows

Threads.@threads for group in collect(SubDataFrame, grouped_rows)
    # do something with `group`
end
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
  • 2
    @korenmiklos FWIW you should update to v1.0. as you start julia with multiple threads, the grouped operations will be run in threads by default. – Florian Oswald May 04 '21 at 10:24
  • @FlorianOswald but I understand that operations for `DataFrames.jl` get parallelized, for your custom processing you still need to handle parallelism yourself. Or you meant something else? – Przemyslaw Szufel May 04 '21 at 12:02
  • 1
    oh no, that's exactly what I meant. What I am saying is that if `do something with group` is for example `combine`, then the new dataframes.jl will run on threads (I didn't know that until today!). otherwise, like you say! – Florian Oswald May 04 '21 at 16:54
  • Yes! lots of DataFrames in-built APIs got multi threaded parallelism very recently! which is very nice. – Przemyslaw Szufel May 04 '21 at 17:01