4

I'm a complete newbie to python dask (a little experience with pandas). I have a large Dask Dataframe (~10 to 20 million rows) that I have to separate based on a unique column value.

For exmaple if I have the following Dataframe with column C1 to Cn (sorry, don't know how to make a proper table in stackoverflow) and I want to create subset Dataframes for each unique value of the column C2

Base Dataframe:


|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| QK |....|time|
Subset Dataframes to be created:

Subset 1:

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| AE |....|time|
|-----------------------|
| 2 |val2| AE |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| AE |....|time|

Subset 2

|Ind| C1 | C2 |....| Cn |
|-----------------------|
| 1 |val1| FB |....|time|
|-----------------------|
| 2 |val2| FB |....|time|
|-----------------------|
|...|....| .. |....| ...|
|-----------------------|
| n |valn| FB |....|time|


and so on.

My current approach is getting all unique values of C2 and filtering the base dataframe for each of this values iteratively. But this takes way to long time. I'm doing research at the moment on how I can improve this process, but I would appreciate it a lot if any of you can give me some tips.

pichlbaer
  • 923
  • 1
  • 10
  • 18

1 Answers1

3

It seems to me that you can obtain the same subsets with groupby both in pandas and dask.

import pandas as pd
import dask.dataframe as dd
import numpy as np
import string

N = 5
rndm2 = lambda :"".join(np.random.choice(list(string.ascii_lowercase), 2))
df_sample = pd.DataFrame({"C1":np.arange(N),
                          "C2":[rndm2() for i in range(N)],
                          "C3":np.random.randn(N)})

M = 2
df = pd.concat([df_sample for i in range(M)], ignore_index=True)
df["C4"] =  np.random.randn(N*M)

Here I'm just printing print(list(df.groupby("C2"))[0][1]) to show you what you have inside every group:

   C1  C2        C3        C4
3   3  bx  0.668654 -0.237081
8   3  bx  0.668654  0.619883

If you need to have to disk nicely partitioned you can do the following

ddf = dd.from_pandas(df, npartitions=4)
ddf.to_parquet("saved/", partition_on=["C2"])

# You can check that the parquet files
# are in separated folder as
! ls saved/ # If you are on Linux

'C2=iw'  'C2=jl'  'C2=qf'  'C2=wy'  'C2=yr'   _common_metadata

Now if you want to perform some computation using these groups you can apply your function fun with map_partitions taking care about the output meta.

df = dd.read_parquet("saved/")
out = df.map_partitions(lambda x: fun(x)).compute() # you should add your output meta
rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Thank you very much. I will give it a try :) – pichlbaer Jan 11 '19 at 07:11
  • Hey...me again. Your solution was a good start. But I've read here (https://stackoverflow.com/a/31364127/2412796) that the apply() method should be avoided. I now want to set my index to the C2 column and partition the dataframe based on the distinct values and call map_partitions(...) afterwards. Question 1: Is this a good idea. Question 2: Do you know how to do this partitioning besides getting all unique entries of the column and calling df.repartition(divisions=[myUniqueEntriesList]) – pichlbaer Jan 14 '19 at 17:35
  • Hi @pichlbaer I guess I'm lost. Do you mind to reformulate? I'm not using `apply` on my answer. If you are using the same data several time is better to have it nicely partitioned on disk. Using set_index could be as expensive as use `apply`. [dask_performance](http://docs.dask.org/en/latest/dataframe-performance.html) – rpanai Jan 14 '19 at 19:24
  • Hey...sorry for the confusion (was a long day yesterday). I have to process each partition of the dataframe independently...and a dask groupby() creates a DataFrameGroupBy Object which only allows the apply() method which should be avoided. And now I want to repartition my dataframe based on the unique column values, but I don't know if this is the best solution. I've read that set_index() is exepensive, I'm just trying to find an appropriate starting point at the moment – pichlbaer Jan 15 '19 at 07:31
  • No problems. If you are just starting I'd suggest you to explore different solutions As `set_index` and save to disk as in answer and see which one suit you the best – rpanai Jan 15 '19 at 16:01