0

I have two dataframes containing information about student grades and test scores. The first looks like this:

     ID    Test_Score    Class1    Class2    Class3
0    001   85            B-        A         C+
1    002   78            B         NaN       B+
2    003   93            A         B         NaN
...

The second looks like this:

     0                1
0    Algebra          A
0    Calculus_1       B
0    Calculus_2       C-
1    Algebra          C+
1    Trig             F
1    Trig             C
1    Calculus_1       C-
...

The indices in each dataframe refer to the same student. So the student at index 0 is the same in both dataframes.

What I want to do is create a pivot table from the second dataframe, where the rows correspond to students (i.e. indices), the columns are the math classes, and the values are the highest grade they made in each class (since it's possible for a student to have taken a class more than once). I will then concatenate this to the first dataframe.

I have already written an aggregation function that finds the highest grade.

I tried the following knowing it would fail:

p = pd.pivot_table(u, columns=0, values=1, aggfunc=highest)

Instead of returning a dataframe with a row for each student, I get a single row that contains the highest grade that any student made in the given class:

     Algebra   Trig    Precalculus   Calculus_1   Calculus_2
1    A         A+      A+            A            A

The way I've framed the problem in my mind, I just need to aggregate across the rows which share in index, which I am unsure how to do.

I'm also open to an entirely different approach.

Desired output:

     ID    Test_Score    Class1    Class2    Class3   Algebra   Trig ...
0    001   85            B-        A         C+       A         NaN
1    002   78            B         NaN       B+       C+        C
2    003   93            A         B         NaN      B         B-
...

EDIT: Here is the code for my 'highest' function:

def highest(x):
    q = 0
    z = None
    for g in x:
        if qpoints(g) > q:
            q = qpoints(g)
            z = g
    return z

where qpoints is the following function that I was already using elsewhere:

def qpoints(x):
    qvalue = {'W': 0,
              'F': 0,
              'D': 1.0,
              'D+': 1.33,
              'C-': 1.67,
              'C': 2.0,
              'C+': 2.33,
              'B-': 2.67,
              'B': 3.0,
              'B+': 3.33,
              'A-': 3.84,
              'A': 4.0,
              'A+': 4.0}
    return qvalue[x]
asgarth
  • 77
  • 1
  • 6

2 Answers2

1

You problem is that you need to keep the index values of the second dataframe when you pivot_table, see this answer for understanding. So if you do:

print (df2.reset_index().pivot_table(index='index', values=[1], columns=[0], 
                                     aggfunc= lambda x: sorted(x)[0])) 
                                     # I used my own idea of highest function
            1                           
0     Algebra Calculus_1 Calculus_2 Trig
index                                   
0           A          B         C-  NaN
1          C+         C-        NaN    C

then you can join such as:

df_p = df2.reset_index().pivot_table(index='index', values=[1], columns=[0], 
                                                aggfunc= lambda x: sorted(x)[0])
df_p.columns = [col[1] for col in df_p.columns]
new_df = df1.join(df_p)

print (new_df)
    ID  Test_Score Class1 Class2 Class3 Algebra Calculus_1 Calculus_2 Trig
0  001          85     B-      A     C+       A          B         C-  NaN
1  002          78      B    NaN     B+      C+         C-        NaN    C
2  003          93      A      B    NaN     NaN        NaN        NaN  NaN
Ben.T
  • 29,160
  • 6
  • 32
  • 54
0

This is what you want. However with pivot you aren't allowed to have duplicate column names, so having Trig twice for the same user is going to result in an error.

So after renaming the duplicate Trig value to Trig2, the join/pivot works well.

df = pd.DataFrame({'ID':['001','002','003'],'Test_Score':[85,78,93],'Class1':['B-','B','A'],'Class2':['A','','B'],'Class3':['C+','B+','',]})

df2 = pd.DataFrame({0:['Algebra','Calculus_1','Calculus_2','Algebra','Trig','Trig2','Calculus_1'],1:['A','B','C-','C+','F','C','C-']}, index=[0,0,0,1,1,1,1])



df.join(df2.pivot(columns=0, values=1))
    ID  Test_Score Class1 Class2 Class3 Algebra Calculus_1 Calculus_2 Trig Trig2
0  001          85     B-      A     C+       A          B         C-  NaN   NaN
1  002          78      B            B+      C+         C-        NaN    F     C
2  003          93      A      B            NaN        NaN        NaN  NaN   NaN
Chris
  • 15,819
  • 3
  • 24
  • 37