How to get substring more efficiently in pandas

Question

I been seeing this, this, and this threads, but still cannot understand how to make the following problem more efficient:

I have a DataFrame with Course Names and the university who offers it:

df_courses:

	course name	university
0	name of course one	university one
1	name of course two	university one
2	"name of course three, with comma"	university two

And I have another DataFrame that contains students enrollments:

df_enrollments:

	enrollments	student email
0	name of course one, name of course two	student1@faculty.com
1	name of course two, name of course three	student2@faculty2.com
2	name of course three, with comma, name of course one, name of course two	student3@faculty.com

What I want to do is to get a new dataframe with each enrollment by student:

df_all_enrollments:

	course name	student email
0	name of course one	student1@faculty.com
1	name of course two	student1@faculty.com
2	name of course two	student2@faculty2.com
3	"name of course three, with comma"	student2@faculty2.com
4	"name of course three, with comma"	student3@faculty.com
5	name of course one	student3@faculty.com
6	name of course two	student3@faculty.com

The main issue is with the courses names with commas.

What I'm doing now to get this result is to make a list of df_courses['course name'] and then iterate over df_enrollments['enrollments'] searching contains and adding a new column with the course name:

courses = df_courses['course name'].to_list()   
df_all_enrollments = pd.DataFrame()
for i in courses:
    df_all_enroll = df_enrollments.loc[df_enrollments['enrollments'].str.contains(i, na=False, regex=False, case=True)]
    df_all_enroll.insert(1, 'Course Name', i)
    df_all_enrollments = pd.concat([df_all_enrollments, df_all_enroll ])

Until now this approach has worked, but I'm wondering if there's a more efficient way to perform this task.

Any help will be greatly appreciated.

sorry Timus, I just edited the question because I forgot something important, but for some reason the editor gives a "code is not well formatted" error. The only way to get past that flag was to transform the tables into code :( — JairoM, Mar 15 '22 at 14:32
Ok, I see. I've just fixed the last table and hope the post is okay. — Timus, Mar 15 '22 at 14:56

score 0 · Answer 1 · answered Mar 14 '22 at 20:08

Is it what you expect:

courses = df2['enrollments'].str.split(', ')

df_all_enrollments = df_enrollments.assign(**{'course name': courses}) \
                                   .explode('course name', ignore_index=True)
print(df_all_enrollments)

Output

>>> df_all_enrollments
                                         enrollments          student email           course name
0             name of course one, name of course two   student1@faculty.com    name of course one
1             name of course one, name of course two   student1@faculty.com    name of course two
2           name of course two, name of course three  student2@faculty2.com    name of course two
3           name of course two, name of course three  student2@faculty2.com  name of course three
4  name of course three, name of course one, name...   student3@faculty.com  name of course three
5  name of course three, name of course one, name...   student3@faculty.com    name of course one
6  name of course three, name of course one, name...   student3@faculty.com    name of course two

Your answer is absolutely correct, but I ran into a problem that I forgot to clarify in my original question. Some of the courses have commas in their name (such as "course name four, with comma"). Sorry about that. — JairoM, Mar 15 '22 at 01:05

JairoM · Accepted Answer · 2022-03-15T17:30:29.003

Well, I have a possible answer.

The original approach took 6.6 minutes to process 41541 rows and output a DataFrame and CSV with 240603 rows.

Corralien's proposed answer is absolutely correct, but my data has some names with commas (eg 'Introduction to TensorFlow for AI, machine learning and deep learning')

The first problem (checking for matched course names in df_enrollments) was solved with a list comprehension:

df_enrollments['Course Name'] = [[y for y in cursos if y in x] for x in df_enrollments['Enrollments']]

The result, which was a column with a list of courses, is exploded up and gives me the expected result:

df_enrollments = df_enrollments.explode('Course Name')

Now it only takes 98.73 seconds to do it all :D

How to get substring more efficiently in pandas

2 Answers2