Expand DataFrame to complete range of values in groupby

Question

I have a DataFrame that contains objects and items belonging to the objects. Items have additional data (not shown) and multiple items can belong to one object.

df = pd.DataFrame(
    {
        "object_id": [1, 1, 1, 1, 1, 2, 2, 2],
        "item_id": [1, 2, 4, 4, 5, 1, 1, 2],
        "item_count": [6, 6, 6, 6, 6, 3, 3, 3],
    }
)

I now want to group by the object_id and extract information from the associated items. While this works, it does not add items that are not already in the DataFrame (i.e. "0" values).

df_group = df.groupby(["object_id", "item_id"], as_index=False).size()

>>> df_group
   object_id  item_id  size
0          1        1     1
1          1        2     1
# e.g. item 3 missing
2          1        4     2
3          1        5     1
4          2        1     2
5          2        2     1

I now wanted to find out if there is a way to expand the groupby given the item_counts. My current naive approach is to create an dataframe list and merge the groupby afterwards:

all_items = [
    dict(object_id=entity, item_id=obj + 1)
    for entity in df["object_id"].unique()
    for obj in range(df.loc[df["object_id"] == entity, "item_count"].iloc[0])
]
df_full = pd.DataFrame(all_items).merge(df_group, how="left").fillna(0).astype({"size": "int"})

>>> df_full
   object_id  item_id  size
0          1        1     1
1          1        2     1
2          1        3     0
3          1        4     2
4          1        5     1
5          1        6     0
6          2        1     2
7          2        2     1
8          2        3     0

What is the output you want exactly? Why does there need to be a row for `object_id=1`, `item_id=6` but not one for `object_id=2`, `item_id=6`? — Matteo Zanoni, Nov 11 '22 at 16:29
The `df_full` is the desired output. I want all `object_id`'s and `item_id`'s listed plus the aggregate function (here for simplicity size) — BBQuercus, Nov 11 '22 at 16:32
Ok, what do you mean with _I want all `object_id`'s and `item_id`'s listed_? In the data you provided `item_id=6` never shows up but you want it in the output? — Matteo Zanoni, Nov 11 '22 at 16:36
Exactly. It doesn't show up but given the `item_count`column it's still there just with 0 occurrences. For subsequent statistics I'll need the 0's too — BBQuercus, Nov 11 '22 at 16:37
Why no `item_id=6` for `object_id=2` then? Sorry for the continuous questions but I'm trying to understand the algorithm — Matteo Zanoni, Nov 11 '22 at 16:40
No problem. Object 2 only has 3 total `item_count` so I only need those three — BBQuercus, Nov 11 '22 at 16:42

Naveed · Accepted Answer · 2022-11-11T17:05:18.243

here is one way to do it

# summarize the duplicate item ids and create a new df
# its needed at this stage to allow us to use reindex later
df2=df.groupby(['object_id','item_id','item_count'], as_index=False).size()

# groupby the object_id then applying lambda on the group, 
# set item id as an index, which is now unique
# reindex the group result with item ids ranging from min of item id
# to the count under item_count column + 1

df3=(df2.groupby('object_id', as_index=False)
 .apply(lambda x: x.set_index(['item_id']).reindex( range(x['item_id'].min(), x['item_count'].max() + 1) ))
)

# null values in size, make them o
df3['size'].fillna(0, inplace=True)

# ffill null values for remaining columns
df3.ffill(inplace=True)

# drop unwanted column after reindex
df3=df3.reset_index().drop(columns='level_0')

# NaN make the column values as float, so turn them back to int
df3=df3[['object_id','item_id','item_count', 'size']].astype(int )
df3

object_id   item_id     item_count  size
0   1   1   6   1
1   1   2   6   1
2   1   3   6   0
3   1   4   6   2
4   1   5   6   1
5   1   6   6   0
6   2   1   3   2
7   2   2   3   1
8   2   3   3   0

score 1 · Answer 2 · answered Nov 11 '22 at 17:01

After you get the df_group you can reindex that group constructing the index by yourself.

The index is constructed by taking all values of object_id and for each of them all values of item_id starting from 1 up to the maximum value of item_count for that object_id.

df.groupby(["object_id", "item_id"]).size().reindex(
    pd.MultiIndex.from_tuples(
        tuples=[
            (object_id, team_id)
            for object_id in df["object_id"].unique()
            for team_id in range(
                1, df[df["object_id"] == object_id]["item_count"].max() + 1
            )
        ],
        names=["object_id", "team_id"],
    ),
    fill_value=0,
)

score -1 · Answer 3 · answered Nov 11 '22 at 16:32

-1

Try this to fill the missing values:

df = (df.set_index('item_id')
        .groupby('object_id')['item_count']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1), fill_value=0))
        .reset_index()
       )

Then do the groupby you need on the new df.

Check the solution here if I missed a detail.

answered Nov 11 '22 at 16:32

Vahid the Great

393
5
18

With his data this code raises `ValueError: cannot reindex on an axis with duplicate labels`. Please check your answer – Matteo Zanoni Nov 11 '22 at 16:34
Please don't just copy and paste another user's solution. If the solution you linked actually worked with minimal modification (like just changing the column names) then you should flag this question as a duplicate, not post a new answer. – Henry Ecker Nov 11 '22 at 16:36

Expand DataFrame to complete range of values in groupby

3 Answers3