1

I have a dataframe with the following structure:

df = pd.DataFrame({'TIME':list('12121212'),'NAME':list('aabbccdd'), 'CLASS':list("AAAABBBB"),
               'GRADE':[4,5,4,5,4,5,4,5]}, columns = ['TIME', 'NAME', 'CLASS','GRADE'])
print(df):
  TIME NAME CLASS  GRADE
0    1    a     A      4
1    2    a     A      5
2    1    b     A      4
3    2    b     A      5
4    1    c     B      4
5    2    c     B      5
6    1    d     B      4
7    2    d     B      5

What I need to do is split the above dataframe into multiple dataframes according to the variable CLASS, convert the dataframe from long to wide (such that we have NAMES as columns and GRADE as the main entry in the data matrix) and then iterate other functions over the smaller CLASS dataframes. If I create a dict object as suggested here, I obtain:

d = dict(tuple(df.groupby('CLASS')))
print(d):
{'A':   TIME NAME CLASS  GRADE
0    1    a     A      4
1    2    a     A      5
2    1    b     A      4
3    2    b     A      5, 'B':   TIME NAME CLASS  GRADE
4    1    c     B      4
5    2    c     B      5
6    1    d     B      4
7    2    d     B      5}

In order to convert the dataframe from long to wide, I used the function pivot_table from pandas:

for names, classes in d.items():
    newdata=df.pivot_table(index="TIME", columns="NAME", values="GRADE")
print(newdata):
NAME  a  b  c  d
TIME
1     4  4  4  4
2     5  5  5  5

So far so good. However, once I obtain the newdata dataframe I am not able to access the smaller dataframes created in d, since the variable CLASS is now missing from the dataframe (as it should be). Suppose I then need to iterate a function over the two smaller subframes CLASS==A and CLASS==B. How would I be able to do this using a for loop if I am not able to define the dataset structure using the column CLASS?

  • newdata you are printing outside the for loop, if you indent it and print inside the loop, you should get individual data frames for each class – Shreyas Prakash Sep 14 '21 at 08:20
  • Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). – Community Sep 20 '21 at 11:30

1 Answers1

1

Try using groupby+apply to conserve the group names:

(df.groupby('CLASS')
   .apply(lambda d: d.pivot_table(index="TIME", columns="NAME", values="GRADE"))
)

output:

              a    b    c    d
CLASS TIME                    
A     1     4.0  4.0  NaN  NaN
      2     5.0  5.0  NaN  NaN
B     1     NaN  NaN  4.0  4.0
      2     NaN  NaN  5.0  5.0

Other possibility, loop over the groups, keeping CLASS as column:

for group_name, group_df in df.groupby('CLASS', as_index=False):
    print(f'working on group {group_name}')
    print(group_df)

output:

working on group A
  TIME NAME CLASS  GRADE
0    1    a     A      4
1    2    a     A      5
2    1    b     A      4
3    2    b     A      5
working on group B
  TIME NAME CLASS  GRADE
4    1    c     B      4
5    2    c     B      5
6    1    d     B      4
7    2    d     B      5
mozway
  • 194,879
  • 13
  • 39
  • 75
  • This seems like I nice solution! How would I be able to loop functions over the subgroups though? And is there a way to retrieve the datasets A and B on their own? – Michael J. Erickson Sep 14 '21 at 08:37
  • @MichaelJ.Erickson `apply` enables you to run any function that takes a dataframe as input. Also, check my update for a solution using loops – mozway Sep 14 '21 at 08:41