How to combine multiple datasets (append or merge) into one in Python based on an identifying variable?

Question

In Python, I am working with longitudinal school data, and have 6 subsets of data, each with the same 4 years of school data (4 files) and the same students for the most part. Each subset represents something different such as standardized test scores, attendance data, etc.

What I want to do is merge them into 1 big file where each student ID is preferably stacked by year and has columns from all of the subsets. For example, let's say a students ID number is 123456, I would want the big data set to look like:

Student ID Year Test Score Days Absent...

123456 2016 97 10

123456 2017 91 14

123456 2018 94 16

Let's say one of the subsets is called "test scores", and in that are 4 files titled 2016, 2017, 2018, and 2019. How would I merge those 4 files together so that they are stacked based on the student id number for each school year like how it is above?

And after I merge the files of that subset, let's say there's another subset called "achievement" which is measure of teacher evaluation on students. One of those variables is the same student id, and another variable is called grade level. How would I go about then merging in the grade level column based on student ID number into the merged test scores file so that the students in the test scores merged dataset now have a grade level associated with them?

Thanks!

Does this answer your question? [Pandas Merging 101](https://stackoverflow.com/questions/53645882/pandas-merging-101) — Marcelo Paco, Apr 13 '23 at 23:53

score 0 · Answer 1 · answered Apr 14 '23 at 06:26

IIUC, you need a combo concat/merge :

import pandas as pd
    
scores = (pd.concat([pd.read_csv(f"test_scores_{year}.csv") # make sure the adjust the years
                     for year in range(2016, 2020)]).sort_values(["Student ID", "Year"]))

achv = pd.read_csv("achievement.csv")

out = scores.merge(achv[["Student ID", "Grade Level"]],
                   on="Student ID", how="left") # make sure to adjust the column names

out.to_csv("TestScore_&_Achievements.csv", index=False)

Output :

    Student ID  Year  Test Score  Days Absent  Grade Level
0       123456  2016          97           10            8
1       123456  2017          91           14            8
2       123456  2018          94           16            8
..         ...   ...         ...          ...          ...
17      678901  2017          95            3            8
18      678901  2018          97            1            8
19      678901  2019          95            2            8

[20 rows x 5 columns]

How to combine multiple datasets (append or merge) into one in Python based on an identifying variable?

1 Answers1