Is it possible to drop some categories when reading partitioned data into Dask DataFrame?
For example, I have partitioned parquet in
events/year=2017/month=09/day=01/hour=01/customer=a.com/xxxx.parquet
events/year=2017/month=09/day=01/hour=02/customer=a.com/xxxx.parquet
events/year=2017/month=09/day=01/hour=01/customer=a.com/xxxx.parquet
I read it with:
df = dd.read_parquet('./events/24.100/year=*/month=*/day=*/hour=*/customer=*/*.parquet')
After reading, hour
and customer
are present in my data as categories:
Dask DataFrame Structure:
url referrer session_id ts hour customer
npartitions=24
object object object datetime64[ns] category[known] category[known]
... ... ... ... ... ...
... ... ... ... ... ...
Dask Name: read-parquet, 24 tasks
I want to drop hour
but keep customer
. How do I do that?