1

Is it possible to drop some categories when reading partitioned data into Dask DataFrame?

For example, I have partitioned parquet in

events/year=2017/month=09/day=01/hour=01/customer=a.com/xxxx.parquet
events/year=2017/month=09/day=01/hour=02/customer=a.com/xxxx.parquet
events/year=2017/month=09/day=01/hour=01/customer=a.com/xxxx.parquet

I read it with:

df = dd.read_parquet('./events/24.100/year=*/month=*/day=*/hour=*/customer=*/*.parquet')

After reading, hour and customer are present in my data as categories:

Dask DataFrame Structure:
                   url referrer session_id              ts             hour         customer
npartitions=24
                object   object     object  datetime64[ns]  category[known]  category[known]
                   ...      ...        ...             ...              ...              ...
                   ...      ...        ...             ...              ...              ...
Dask Name: read-parquet, 24 tasks

I want to drop hour but keep customer. How do I do that?

j-bennet
  • 310
  • 3
  • 11
  • 1
    It was not a duplicate, because `df.drop` did not work for categorical columns. But I just tried again with latest `dask` from master, and it works. Now it really is a duplicate. – j-bennet Mar 27 '18 at 04:41

0 Answers0