2

Assuming we have a DataFrame with 4 columns:

import pandas as pd

df_test = pd.DataFrame({
    'date': ['2022-12-01', '2022-12-01', '2022-12-01', '2022-12-02', '2022-12-02', '2022-12-02'],
    'id': ['id1', 'id1', 'id2', 'id3', 'id4', 'id4'],
    'element': ['ip1', 'ip2', 'ip3', 'ip4', 'ip5', 'ip6'],
    'related_id': ['rid1', 'rid2', 'rid3', 'rid4', 'rid5', 'rid6'],
})
df_test

This sample DataFrame looks like:

0   2022-12-01  id1 e1  rid1
1   2022-12-01  id1 e2  rid2
2   2022-12-01  id2 e3  rid3
3   2022-12-02  id3 e4  rid4
4   2022-12-02  id4 e5  rid5
5   2022-12-02  id4 e6  rid6

I need to:

  • Group by the first 2 columns: date and id
  • Aggregating the remaining columns as:
    • list of elements
    • distinct count of related_id
(df_test
.groupby(['date', 'id'], as_index=False)
.agg(elements=('element', pd.Series.to_list),
     distinct_related_ids=('related_id', 'nunique')))

So far so good, the result is exactly what I was looking for:

0   2022-12-01  id1 [e1, e2]    2
1   2022-12-01  id2 [e3]        1
2   2022-12-02  id3 [e4]        1
3   2022-12-02  id4 [e5, e6]    2

Unfortunately when I run the exact same code on a much bigger DataFrame, I get the Error:

ValueError: Length of values (263128) does not match length of index (8156968)

BTW the details for the StackTrace:

  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 950, in aggregate
    self._insert_inaxis_grouper_inplace(result)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/groupby/generic.py", line 1485, in _insert_inaxis_grouper_inplace
    result.insert(0, name, lev)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4821, in insert
    value = self._sanitize_column(value)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/frame.py", line 4915, in _sanitize_column
    com.require_length_match(value, self.index)
  File "/Users/fvitale/.pyenv/versions/adversary/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (263128) does not match length of index (8156968)

What is going on here and how can I debug or rewrite this implementation to work with a bigger DataFrame?


EDIT

If I split that big DataFrame in single days, that groupby and agg combination works as expected. That error manifests itself only when multiple days are involved.


EDIT 2

While bisecting the "big DataFrame" looking for a smaller example as suggested by @mozway in the comments, I noticed that the problem seems related with groupby as changing the agg for the nunique gives the same error, with the same details:

(df_big
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (263128) does not match length of index (8156968)

Where 263128 is the number of unique id across the whole multi-day df_big.

Same error if using query on a single day:

import datetime

(df_big
.query('date == @datetime.date(2022, 12, 1)')
.groupby(['date', 'id'], as_index=False)
.nunique())
ValueError: Length of values (8682) does not match length of index (263128)

Where 8682 are the unique ids for that day:

df_big.query('date == @datetime.date(2022, 12, 1)')['id'].nunique()

EDIT 3

In the attempt to rule out the assumption this problem is related to the size of the DataFrame, I followed the suggestion from the first link mentioned in the comments by @wjandrea and generated many synthetic DataFrames:

import numpy as np

rows = 10_000_000
np.random.seed()

df_test = pd.DataFrame({
    'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
    'id': np.random.choice(range(100), rows),
    'element': np.random.randn(rows),
    'related_id': np.random.choice(range(10), rows),
})
df_test_compressed = df_test_3.groupby(['date', 'id'], as_index=False).agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique'))
df_test_compressed['elements'].map(len).value_counts()

Unfortunately not a single instance tested threw that ValueError.


EDIT 4

"minimal reproducible example" with only 2 rows:

import pandas as pd
import numpy as np


rows = 2
np.random.seed(42)

df = pd.DataFrame({
    'date': np.random.choice(pd.date_range('2022-12-01', periods=31, freq='D'), rows),
    'id': np.random.choice(range(100), rows),
    'element': np.random.randn(rows),
    'related_id': np.random.choice(range(10), rows),
}).astype({
    'id': 'category',
    'element': 'category',
    'related_id': 'category',
})

Group by and aggregating:

(df
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))

Triggers the Error:

ValueError: Length of values (2) does not match length of index (4)

EDIT 5

As of March 2023 it is still an unresolved bug.

Filippo Vitale
  • 7,597
  • 3
  • 58
  • 64
  • 1
    can you try to split your dataframe in two, check if either part raises the error, and start again until you get a minimal example that reproduces it? – mozway Mar 09 '23 at 15:18
  • @mozway I updated the question mentioning that the small part (single day) are working as expected – Filippo Vitale Mar 09 '23 at 15:22
  • No, I meant, can you try to generate a minimal example that reproduces the error? If you split in two (randomly, not by days) this doesn't raise the error for either part? Can you try to provide the full dataset in another way? – mozway Mar 09 '23 at 15:30
  • See [How to make good reproducible pandas examples](/q/20109391/4518341), and [mre] in general. – wjandrea Mar 09 '23 at 15:53
  • Dear @wjandrea I have been keep trying to create a "minimal reproducible example" that was between the one I put in this question and the DataFrame I have at hands but I still haven't found it. I thought that was clear from the text of my question that I am still searching for it. – Filippo Vitale Mar 09 '23 at 16:07
  • @FilippoVitale It's clear, I'm just offering advice. – wjandrea Mar 09 '23 at 16:14
  • @wjandrea thanks for that, I tried one of the suggestions from the first link you posted here in the comments and updated the question. – Filippo Vitale Mar 09 '23 at 17:33
  • Beside the point, but I don't think `np.random.seed()` really does anything there. The point of it in the link is to pass in a seed so that we would have the same "random" data you do. – wjandrea Mar 09 '23 at 18:12
  • 1
    @wjandrea I have been cycling through different seed numbers to generate a "reproducible example". Forgot to remove that line when editing the question. – Filippo Vitale Mar 09 '23 at 18:16
  • @Filippo can you provide the original DataFrame that triggers the error? – mozway Mar 09 '23 at 18:43
  • @mozway thank you for asking, but I found the culprit! Posting an answer very soon. Again thank you for all your comments – Filippo Vitale Mar 09 '23 at 18:45
  • Thanks for the feedback, I had the feeling a bug was involved but it was hard to know without the data ;) – mozway Mar 10 '23 at 05:16

1 Answers1

1

The problem here is the category type of one of the columns in the index, id in this case.

One way to avoid that error is to use a literal type for the id column, for instance string:

(df.astype({'id': 'string'})
 .groupby(['date', 'id'], as_index=False)
 .agg(elements=('element', pd.Series.to_list),distinct_related_ids=('related_id', 'nunique')))
Filippo Vitale
  • 7,597
  • 3
  • 58
  • 64