-1

I have collated files of user testing data into a DataFrame in which columns 0 - 2 are their characteristics and the columns from 3 onwards have item codes for the product brand type/model they have tested. I would like to tally the unique item codes tested grouped by the level, age group, and gender to create summaries and keep track of how many times certain models have been tested. The output needs to have the item code name listed as well with how many times it has been tested for that particular user characteristics, i.e. for an elite, 20-40, F: T3 has been tested 12 times.

I have tried using groupby() methods but cannot get the actual item code name listed with its occurrence count number as well.

The output can be a table form, DataFrame, or suggest another efficient method to tackle this question.

Dataframe format:

                  age-
index level       group  gender  0     1     2    ...... 80
1     elite        20-40   F    T3    T4    RR23        T34
2     intermediate <11     M    T56   T44   RR2         T81
3     elite        11-20   M    T45   RR23  T3          RR2
4     beginner     >40     F    RR2   T56   T3          T45
5     advanced     20-40   F    RR22  T4    T45         RR22
6     beginner     11-20   M    T81   T4    T56         T4
Em C
  • 13
  • 2
  • Please post your expected output based on sample input. – Mayank Porwal Jun 13 '20 at 09:39
  • Please provide a small set of sample data as text that we can copy and paste. Include the corresponding desired result. Check out the guide on [how to make good reproducible pandas examples](https://stackoverflow.com/a/20159305/3620003). – timgeb Jun 13 '20 at 09:46

1 Answers1

0

Firstly, please see this post regarding how to produce good pandas examples.

df

          level  group gender     0     1     2     3
0         elite  20-40      F    T3    T4  RR23   T34
1  intermediate    <11      M   T56   T44   RR2   T81
2         elite  11-20      M   T45  RR23    T3   RR2
3      beginner    >40      F   RR2   T56    T3   T45
4      advanced  20-40      F  RR22    T4   T45  RR22
5      beginner  11-20      M   T81    T4   T56    T4
6  intermediate    <11      M   T56   T44   RR2   T34
7         elite  11-20      M   T45  RR23    T3  RR23

Assuming each row corresponds to a person, and the 80 columns (labelled by numbers) are some sort of result from of 80 trials/experiments performed on that person, the data can be reorganized so it can be summarized and counted easily.

import numpy as np
import pandas as pd
from collections import Counter

values = pd.unique(df.loc[:, df.columns[3:]].values.ravel('K'))
cols = df.columns[:3].to_list() + list(values)
df2 = pd.DataFrame(data=[], columns=cols)

counts = df.loc[:, df.columns[3:]].apply(Counter, axis=1)
for i, c in enumerate(counts):
    tmp = pd.DataFrame(dict(c), index=[i])
    df2 = df2.append(tmp, ignore_index=True)
df2.loc[:, df2.columns[:3]] = df.loc[:, df.columns[:3]]
df2 = df2.fillna(0)

Which makes the column names the actual result label, and the value corresponds to the 'count' of that that result label.

          level  group gender  T3  T56  T45  RR2  RR22  T81  T4  T44  RR23  T34
0         elite  20-40      F   1    0    0    0     0    0   1    0     1    1
1  intermediate    <11      M   0    1    0    1     0    1   0    1     0    0
2         elite  11-20      M   1    0    1    1     0    0   0    0     1    0
3      beginner    >40      F   1    1    1    1     0    0   0    0     0    0
4      advanced  20-40      F   0    0    1    0     2    0   1    0     0    0
5      beginner  11-20      M   0    1    0    0     0    1   2    0     0    0
6  intermediate    <11      M   0    1    0    1     0    0   0    1     0    1
7         elite  11-20      M   1    0    1    0     0    0   0    0     2    0

Now you could group by level, group and gender if you wanted to compute some group statistics, like how many 'T3' results exist across the level, group and gender groupings.

df2.groupby(['level', 'group', 'gender'])['T3'].sum()

level         group  gender
advanced      20-40  F         0
beginner      11-20  M         0
              >40    F         1
elite         11-20  M         2
              20-40  F         1
intermediate  <11    M         0
Name: T3, dtype: int64
eNc
  • 1,021
  • 10
  • 23
  • Cheers @eNc for this great response! You did interpret my question right and it looks like this will help me tremendously; I'll work it into my code :) (and work on my question posting formating). – Em C Jun 14 '20 at 03:51