0

I have a pandas dataframe with texts each of which can belong to one genre and to several categories each. As each text can belong to several categories, the respective columns are one-hot encoded.

Here is an example (the actual dataframe has a lot more categories):

df = pd.DataFrame({'text':{0:'This is an example string', 1: 'this is another example', 2:'and another',3:'and yet another example'},'genre':{0: 'fiction', 1: 'fiction', 2: 'scientific', 3: 'news'},'category_nature':{0: 1, 1: 1, 2: 0, 3:1}, 'category_history':{0: 1, 1: 0, 2: 0, 3:1},'category_art':{0: 0, 1: 0, 2: 1, 3: 0}})

I'm looking for a way to get something like value_counts() on the categories but also on the genre, like this: enter image description here

I first tried to change the format of the one-hot-encoded columns, but then I lose the "genre" column.

df_new = df.drop(columns=['text','genre']);
count = df_new.sum().sort_values(ascending=False)

I also checked the following post, but it wasn't exactly what I was looking for.

Python: get a frequency count based on two columns (variables) in pandas dataframe some row appers

Value_counts on multiple columns with groupby

Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189

2 Answers2

0

You can melt and groupby:

(df.melt(id_vars=['text', 'genre'], var_name='category', value_name='count')
   .groupby(['genre', 'category'])
   ['count'].sum()
   # below is for formatting only
   .reset_index()
   .query('count > 0')
   .assign(category=lambda d:d['category'].str[9:])
)

output:

        genre category  count
1     fiction  history      1
2     fiction   nature      2
4        news  history      1
5        news   nature      1
6  scientific      art      1
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Is this what you were after?

df.groupby('genre').sum().reset_index().melt(id_vars='genre')


        genre          variable  value
0     fiction   category_nature      2
1        news   category_nature      1
2  scientific   category_nature      0
3     fiction  category_history      1
4        news  category_history      1
5  scientific  category_history      0
6     fiction      category_art      0
7        news      category_art      0
8  scientific      category_art      1
Quixotic22
  • 2,894
  • 1
  • 6
  • 14