1

I'd like to count multiple values (contained in a list per cell) on a groupBy object.

I have the following dataframe:

| | Record the respondent’s sex | 7. What do you use the phone for? | |---|-----------------------------|---------------------------------------------| | 0 | Male | sending texts;calls;receiving sending texts | | 1 | Female | sending texts;calls;WhatsApp;Facebook | | 2 | Male | sending texts;calls;receiving texts | | 3 | Female | sending texts;calls |

I would like to count every value in column 7. What do you use the phone for?, after grouping on Record the respondent’s sex.

I have no problem doing this when there is only one value per cell.

grouped = df.groupby(['Record the respondent’s sex'], sort=True)

question_counts = grouped['2. Are you a teacher, caregiver, or young adult ?'].value_counts(normalize=False, sort=True)

question_data = [
      {'2. Are you a teacher, caregiver, or young adult ?': question, 'Record the respondent’s sex': group, 'count': count*100} for
      (group, question), count in dict(question_counts).items()]

df_question = pd.DataFrame(question_data)

Gives me a table which looks exactly like this:

| 7. What do you use the phone for? | Record the respondent's sex | count | |-----------------------------------|-----------------------------|-------| | sending texts | Male | 2 | | calls | Male | 2 | | receiving texts | Male | 2 | | sending texts | Female | 2 | | calls | Female | 2 | | WhatsApp | Female | 1 | | Facebook | Female | 1 |

If only I could get this working with multiple values!

value_counts() doesn't work on lists with multiple values, it throws an TypeError: unhashable type: 'list' error. The question Counting occurrence of values in a Panda series? shows how to deal with this in various ways, but I can't seem to get it to work on a GroupBy object.

John Boss
  • 55
  • 5
  • Duplicating/exploding the multiple values into rows indeed seems like the simplest and fastest way to go about this (see https://stackoverflow.com/questions/12680754/split-explode-pandas-dataframe-string-entry-to-separate-rows), altough the accepted answer below shows it can also be done without doing that. – John Boss Sep 14 '17 at 11:07

1 Answers1

1
# Initialize sample data.
df = pd.DataFrame({'Record the respondent’s sex': ['Male', 'Female'] * 2, 
                   '7. What do you use the phone for?': [
                       "sending texts;calls;receiving sending texts",
                       "sending texts;calls;WhatsApp;Facebook",
                       "sending texts;calls;receiving texts",
                       "sending texts;calls"
                   ]})

# Split the values on ';' and separate into columns.  Melt the result.
df2 = pd.melt(
    pd.concat([df['Record the respondent’s sex'],
               df.loc[:, "7. What do you use the phone for?"].apply(
                   lambda series: series.split(';')).apply(pd.Series)], axis=1),
    id_vars='Record the respondent’s sex')[['Record the respondent’s sex', 'value']]

# Group on gender and rename columns.
result = df2.groupby('Record the respondent’s sex')['value'].value_counts().reset_index()
result.columns = ['Record the respondent’s sex', '7. What do you use the phone for?', 'count']

# Reorder columns.
>>> result[['7. What do you use the phone for?', 'Record the respondent’s sex', 'count']]
  7. What do you use the phone for? Record the respondent’s sex  count
0                             calls                      Female      2
1                     sending texts                      Female      2
2                          Facebook                      Female      1
3                          WhatsApp                      Female      1
4                             calls                        Male      2
5                     sending texts                        Male      2
6           receiving sending texts                        Male      1
7                   receiving texts                        Male      1
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • I had no idea about pd.melt(), and it seems to do the trick just great. Thanks! – John Boss Sep 10 '17 at 18:22
  • In hindsight, going the route of just creating an extra row per multiple value (which is how MaxU noted this as duplicate) is easier and probably also faster. – John Boss Sep 14 '17 at 11:00