1

I have a DataFrame that contains objects and items belonging to the objects. Items have additional data (not shown) and multiple items can belong to one object.

df = pd.DataFrame(
    {
        "object_id": [1, 1, 1, 1, 1, 2, 2, 2],
        "item_id": [1, 2, 4, 4, 5, 1, 1, 2],
        "item_count": [6, 6, 6, 6, 6, 3, 3, 3],
    }
)

I now want to group by the object_id and extract information from the associated items. While this works, it does not add items that are not already in the DataFrame (i.e. "0" values).

df_group = df.groupby(["object_id", "item_id"], as_index=False).size()

>>> df_group
   object_id  item_id  size
0          1        1     1
1          1        2     1
# e.g. item 3 missing
2          1        4     2
3          1        5     1
4          2        1     2
5          2        2     1

I now wanted to find out if there is a way to expand the groupby given the item_counts. My current naive approach is to create an dataframe list and merge the groupby afterwards:

all_items = [
    dict(object_id=entity, item_id=obj + 1)
    for entity in df["object_id"].unique()
    for obj in range(df.loc[df["object_id"] == entity, "item_count"].iloc[0])
]
df_full = pd.DataFrame(all_items).merge(df_group, how="left").fillna(0).astype({"size": "int"})

>>> df_full
   object_id  item_id  size
0          1        1     1
1          1        2     1
2          1        3     0
3          1        4     2
4          1        5     1
5          1        6     0
6          2        1     2
7          2        2     1
8          2        3     0
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57
BBQuercus
  • 819
  • 1
  • 11
  • 28
  • What is the output you want exactly? Why does there need to be a row for `object_id=1`, `item_id=6` but not one for `object_id=2`, `item_id=6`? – Matteo Zanoni Nov 11 '22 at 16:29
  • The `df_full` is the desired output. I want all `object_id`'s and `item_id`'s listed plus the aggregate function (here for simplicity size) – BBQuercus Nov 11 '22 at 16:32
  • Ok, what do you mean with _I want all `object_id`'s and `item_id`'s listed_? In the data you provided `item_id=6` never shows up but you want it in the output? – Matteo Zanoni Nov 11 '22 at 16:36
  • Exactly. It doesn't show up but given the `item_count`column it's still there just with 0 occurrences. For subsequent statistics I'll need the 0's too – BBQuercus Nov 11 '22 at 16:37
  • Why no `item_id=6` for `object_id=2` then? Sorry for the continuous questions but I'm trying to understand the algorithm – Matteo Zanoni Nov 11 '22 at 16:40
  • No problem. Object 2 only has 3 total `item_count` so I only need those three – BBQuercus Nov 11 '22 at 16:42

3 Answers3

1

here is one way to do it

# summarize the duplicate item ids and create a new df
# its needed at this stage to allow us to use reindex later
df2=df.groupby(['object_id','item_id','item_count'], as_index=False).size()

# groupby the object_id then applying lambda on the group, 
# set item id as an index, which is now unique
# reindex the group result with item ids ranging from min of item id
# to the count under item_count column + 1

df3=(df2.groupby('object_id', as_index=False)
 .apply(lambda x: x.set_index(['item_id']).reindex( range(x['item_id'].min(), x['item_count'].max() + 1) ))
)

# null values in size, make them o
df3['size'].fillna(0, inplace=True)

# ffill null values for remaining columns
df3.ffill(inplace=True)

# drop unwanted column after reindex
df3=df3.reset_index().drop(columns='level_0')

# NaN make the column values as float, so turn them back to int
df3=df3[['object_id','item_id','item_count', 'size']].astype(int )
df3

object_id   item_id     item_count  size
0   1   1   6   1
1   1   2   6   1
2   1   3   6   0
3   1   4   6   2
4   1   5   6   1
5   1   6   6   0
6   2   1   3   2
7   2   2   3   1
8   2   3   3   0
Naveed
  • 11,495
  • 2
  • 14
  • 21
1

After you get the df_group you can reindex that group constructing the index by yourself.

The index is constructed by taking all values of object_id and for each of them all values of item_id starting from 1 up to the maximum value of item_count for that object_id.

df.groupby(["object_id", "item_id"]).size().reindex(
    pd.MultiIndex.from_tuples(
        tuples=[
            (object_id, team_id)
            for object_id in df["object_id"].unique()
            for team_id in range(
                1, df[df["object_id"] == object_id]["item_count"].max() + 1
            )
        ],
        names=["object_id", "team_id"],
    ),
    fill_value=0,
)
Matteo Zanoni
  • 3,429
  • 9
  • 27
-1

Try this to fill the missing values:

df = (df.set_index('item_id')
        .groupby('object_id')['item_count']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1), fill_value=0))
        .reset_index()
       )

Then do the groupby you need on the new df.

Check the solution here if I missed a detail.

Vahid the Great
  • 393
  • 5
  • 18
  • With his data this code raises `ValueError: cannot reindex on an axis with duplicate labels`. Please check your answer – Matteo Zanoni Nov 11 '22 at 16:34
  • Please don't just copy and paste another user's solution. If the solution you linked actually worked with minimal modification (like just changing the column names) then you should flag this question as a duplicate, not post a new answer. – Henry Ecker Nov 11 '22 at 16:36