Assuming we have a DataFrame with 4 columns:
import pandas as pd
df_test = pd.DataFrame({
'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test
This sample DataFrame looks like:
0 2022-12-01 id1 e1 rid1
1 2022-12-01 id1 e2 rid2
2 2022-12-01 id2 e3 rid3
3 2022-12-02 id3 e4 rid4
4 2022-12-02 id4 e5 rid5
5 2022-12-02 id4 e6 rid6
I need to:
- Group by the first 2 columns:
date
andid
- Aggregating the remaining columns as:
- list of
element
s - distinct count of
related_id
- list of
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
distinct_related_ids=('related_id', 'nunique')))
So far so good, the result is exactly what I was looking for:
0 2022-12-01 id1 [e1, e2] 2
1 2022-12-01 id2 [e3] 1
2 2022-12-02 id3 [e4] 1
3 2022-12-02 id4 [e5, e6] 2
Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:
ValueError: Length of values (263128) does not match length of index (8156968)
BTW the details for the StackTrace:
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
self._insert_inaxis_grouper_inplace(result)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
result.insert(0, name, lev)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
value = self._sanitize_column(value)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
com.require_length_match(value, self.index)
File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)
What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?
EDIT
If I split that big DataFrame in single days, that groupby
and agg
combination works as expected. That error manifests itself only when multiple days are involved.
EDIT 2
While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby
as changing the agg
for the nunique
gives the same error, with the same details:
(df_big
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (263128) does not match length of index (8156968)
Where 263128
is the number of unique id
across the whole multi-day df_big
.
Same error if using query
on a single day:
import datetime
(df_big
.query('date == @datetime.date(2022, 12, 1)')
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (8682) does not match length of index (263128)
Where 8682
are the unique id
s for that day:
df_big.query('date == @datetime.date(2022, 12, 1)')['id'].nunique()
EDIT 3
In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:
import numpy as np
rows = 10_000_000
np.random.seed()
df_test = pd.DataFrame({
'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
'id': np.random.choice(range(100), rows),
'element': np.random.randn(rows),
'related_id': np.random.choice(range(10), rows),
})
df_test_compressed = df_test_3.groupby(['date', 'id'], as_index=False).agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique'))
df_test_compressed['elements'].map(len).value_counts()
Unfortunately not a single instance tested threw that ValueError
.
EDIT 4
"minimal reproducible example" with only 2 rows:
import pandas as pd
import numpy as np
rows = 2
np.random.seed(42)
df = pd.DataFrame({
'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
'id': np.random.choice(range(100), rows),
'element': np.random.randn(rows),
'related_id': np.random.choice(range(10), rows),
}).astype({
'id': 'category',
'element': 'category',
'related_id': 'category',
})
Group by and aggregating:
(df
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))
Triggers the Error:
ValueError: Length of values (2) does not match length of index (4)
EDIT 5
As of March 2023 it is still an unresolved bug.