2

I need to find the frequency of every elements in the array while keeping the information about the array shape. This is because I'll need to iterate over it later on.

I tried this solution as well as this one. It works well for numpy however it doesn't seem to work in dask due to the limitation of dask arrays needing to know their size for most operation.

import dask.array as da

arr = da.from_array([1, 1, 1, 2, 3, 4, 4])

unique, counts = da.unique(arr, return_counts=True)

print(unique)
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>

print(counts)
# dask.array<getitem, shape=(nan,), dtype=int64, chunksize=(nan,)>

I am looking for something similar to this:

import dask.array as da

arr = da.from_array([1, 1, 1, 2, 3, 4, 4])

print(da.frequency(arr))
# {1: 3, 2: 1, 3:1, 4:2}
mathdugre
  • 98
  • 7

2 Answers2

1

I found that this solution was the fastest for a large amount (~37.5 Billion elements) of data with many unique values (>50k).

import dask
import dask.array as da

arr = da.from_array(some_large_array)

bincount = da.bincount(arr)
bincount = bincount[bincount != 0]  # Remove elements not in the initial array
unique = da.unique(arr)

# Allows to have the shape of the arrays
unique, counts = dask.compute(unique, bincount)
unique = da.from_array(unique)
counts = da.from_array(counts)

frequency = da.transpose(
    da.vstack([unique, counts])
)
mathdugre
  • 98
  • 7
  • Nice work! You could probably switch from dask.array functions to numpy functions after your call to dask.compute. You probably don't need Dask at this point any longer. http://docs.dask.org/en/latest/best-practices.html#stop-using-dask-when-no-longer-needed – MRocklin May 23 '19 at 14:52
0

Perhaps you can call dask.compute directly after creating the frequency counts. Presumably at this point your dataset is small and now would be a good time to transition away from Dask Array and back to NumPy

import dask
import dask.array as da

arr = da.from_array([1, 1, 1, 2, 3, 4, 4])

unique, counts = da.unique(arr, return_counts=True)

unique, counts = dask.compute(unique, counts)
result = dict(zip(unique, counts))
# {1: 3, 2: 1, 3: 1, 4: 2}
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • This works when I try on a sample of my dataset (~600MB) however I not sure how well it would scale on the full dataset (~75GB). I will definitely give it a shot. Thank you!! – mathdugre May 20 '19 at 14:55
  • I tried the solution you suggested however it seems to slow down considerably with large amount of data and many unique values. I managed to find another solution which is quite fast (see below). I think this might be due to the `return_counts=True` in `da.unique`. Maybe it would require some optimisation; I can open an issue on github I you think it can be useful. – mathdugre May 21 '19 at 20:19