ValueError: Not all divisions are known, can't align partitions error on dask dataframe

Question

I have the following pandas dataframe with the following columns

user_id user_agent_id requests

All columns contain integers. I wan't to perform some operations on them and run them using dask dataframe. This is what I do.

user_profile = cache_records_dataframe[['user_id', 'user_agent_id', 'requests']] \
    .groupby(['user_id', 'user_agent_id']) \
    .size().to_frame(name='appearances') \
    .reset_index() # I am not sure I can run this on dask dataframe

user_profile_ddf = df.from_pandas(user_profile, npartitions=4)
user_profile_ddf['percent'] = user_profile_ddf.groupby('user_id')['appearances'] \
    .apply(lambda x: x / x.sum(), meta=float) #Percentage of appearance for each user group

But I get the following error

raise ValueError("Not all divisions are known, can't align "
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.

Am I doing something wrong? In pure pandas it works great but it gets slow for many lines (although they fit in memory) so I want to parallelize the computations.

I've seen this issue, but how does it help? I tried map_partitions and still didn't work. But issue looks closed — Apostolos, Jul 11 '17 at 09:42

score 1 · Answer 1 · answered May 14 '19 at 20:54

1

When creating the dask dataframe add the reset_index():

user_profile_ddf = df.from_pandas(user_profile, npartitions=4).reset_index()

answered May 14 '19 at 20:54

skibee

1,279
1
17
37

hey, i tried this. but it did not work. can you help me with https://stackoverflow.com/questions/72903541/converting-timestamp-into-proper-format-with-dask-in-python – Coder Jul 08 '22 at 14:41

ValueError: Not all divisions are known, can't align partitions error on dask dataframe

1 Answers1

Linked