I been seeing this, this, and this threads, but still cannot understand how to make the following problem more efficient:
I have a DataFrame with Course Names and the university who offers it:
df_courses
:
course name | university | |
---|---|---|
0 | name of course one | university one |
1 | name of course two | university one |
2 | "name of course three, with comma" | university two |
And I have another DataFrame that contains students enrollments:
df_enrollments
:
enrollments | student email | |
---|---|---|
0 | name of course one, name of course two | student1@faculty.com |
1 | name of course two, name of course three | student2@faculty2.com |
2 | name of course three, with comma, name of course one, name of course two | student3@faculty.com |
What I want to do is to get a new dataframe with each enrollment by student:
df_all_enrollments
:
course name | student email | |
---|---|---|
0 | name of course one | student1@faculty.com |
1 | name of course two | student1@faculty.com |
2 | name of course two | student2@faculty2.com |
3 | "name of course three, with comma" | student2@faculty2.com |
4 | "name of course three, with comma" | student3@faculty.com |
5 | name of course one | student3@faculty.com |
6 | name of course two | student3@faculty.com |
The main issue is with the courses names with commas.
What I'm doing now to get this result is to make a list of df_courses['course name']
and then iterate over df_enrollments['enrollments']
searching contains and adding a new column with the course name:
courses = df_courses['course name'].to_list()
df_all_enrollments = pd.DataFrame()
for i in courses:
df_all_enroll = df_enrollments.loc[df_enrollments['enrollments'].str.contains(i, na=False, regex=False, case=True)]
df_all_enroll.insert(1, 'Course Name', i)
df_all_enrollments = pd.concat([df_all_enrollments, df_all_enroll ])
Until now this approach has worked, but I'm wondering if there's a more efficient way to perform this task.
Any help will be greatly appreciated.