I am processing information in several Pandas DataFrames with 10,000+ rows.
I have...
df1, student information
Class Number Student ID
0 13530159 201733468
1 13530159 201736271
2 13530159 201833263
3 13530159 201931506
4 13530159 201933329
...
df2, student responses
title time stu_id score
0 Unit 12 - Reading Homework 10/30/2020 22:06:53 202031164 100
1 Unit 10 - Vocabulary Homework 11/1/2020 21:07:44 202031674 100
2 Unit 10 - Vocabulary Homework 11/3/2020 17:20:55 202032311 100
3 Unit 12 - Reading Homework 11/6/2020 6:04:37 202031164 95
4 Unit 12 - Reading Homework 11/7/2020 5:49:15 202031164 90
...
I want...
a DataFrame with columns for the class number, student ID, and unique assignment titles. The assignment columns should contain the students' highest score for that assignment. There can be 20+ assignments / columns. A student can have many different scores for a single assignment. I only want the highest. I also want to omit scores submitted after a specific date.
df3, highest student grades
Class Number Student ID Unit 12 - Reading Homework Unit 10 - Vocabulary Homework ...
0 13530159 201733468 100 85 ...
1 13530159 201736271 95 70 ...
2 13530159 201833263 75 65 ...
3 13530159 201931506 80 85 ...
4 13530159 201933329 65 75 ...
...
What is the most efficient way? I will do this several dozen times.
PS, the DataFrames are based on 50+ Google Sheets. I could go back and compile a new DataFrame from the original sheets, but this is time consuming. I'm hoping there is an easier, faster way.
PPS, I've read similar questions: Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas, etc. None of them specifically address my question.