0

I have a data frame that looks like this:

screenshot of table - input

I want to create a matrix that will count the number of times each time per 'ID', 'col2' and 'col3' says a fruit value:

screenshot of table - output

wjandrea
  • 28,235
  • 9
  • 60
  • 81
user
  • 25
  • 4
  • [Please don't post pictures of text](https://meta.stackoverflow.com/q/285551/4518341). Instead, copy the text itself, [edit] it into your post, and use the formatting tools like [code formatting](/editing-help#code) or table formatting. It also wouldn't hurt to provide code that creates this df; see [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Apr 26 '22 at 19:50

2 Answers2

2

One way (vectorized):

df = df.set_index('ID')
new_df = pd.DataFrame(np.sum(df.to_numpy()[:, None] == np.unique(df.to_numpy())[:, None], axis=2), index=df.index, columns=np.unique(df.to_numpy()))

Output:

>>> new_df
     Apple  Orange  Pear
ID                      
001      0       2     0
002      1       0     1
003      1       0     1

If you want to operate on only a subset of the columns:

subset = ['col2', 'col3']
new_df = pd.DataFrame(np.sum(df[subset].to_numpy()[:, None] == np.unique(df[subset].to_numpy())[:, None], axis=2), index=df.index, columns=np.unique(df[subset].to_numpy()))
  • what if the data frame contains more columns, but I only want to look at col 2 and col 3 – user Apr 26 '22 at 19:55
1

You can do:

df.set_index('ID', inplace=True)
dicts = [df.loc[idx].value_counts().to_dict() for idx in df.index]
df2 = pd.DataFrame(dicts, index=df.index).fillna(0).astype(int)

Output df2:

     Orange   Pear   Apple
ID          
001       2      0       0
002       0      1       1
003       0      1       1
SomeDude
  • 13,876
  • 5
  • 21
  • 44