0

I have data with an id and a value, like this:

df = pd.DataFrame({
    "id": [1,1,1,1,2,2,2,3,3],
    "value": ["a", "a", "a", "b", "a", "b", "c", "c", "c"]
})
df

enter image description here

As seen above, the id can occure multiple times. I want to aggregate the data from each id into one row, to get something like this:

enter image description here

I have code that accomplishes that in two steps. First I add the count to each row:

counts = df.groupby("id")["value"].value_counts()
def get_counts(x, instance):
    try:
        return counts.loc[x["id"], instance]
    except:
        return 0

for i in df["value"].unique():
    df["value_count_" + i] = df.apply(lambda x: get_counts(x, i), axis=1)
df

enter image description here

Then I drop the value column and then the duplicates:

df.drop(columns="value").drop_duplicates()

enter image description here

However, although this works, it takes a lot of time for larger datasets and I'm sure there is a more efficient way to get the counts in the first step.

Any ideas?

Thanks in advance!

Julian
  • 591
  • 5
  • 14
  • 3
    crosstab? `pd.crosstab(df['id'],df['value']).add_prefix("value_counts_").reset_index()` or even pivot table with `aggfunc='size` – anky Aug 27 '20 at 15:27
  • https://stackoverflow.com/questions/47152691/how-to-pivot-a-dataframe refer question 9 – anky Aug 27 '20 at 15:30

0 Answers0