1

I have 2 panda data frames

df1=

        student              courses
    0  student1   Math, Science, Gym
    1  student2  gym, Geography, Art
    2  student3            Chemistry
    3  student4   Art, math, physics
    4  student5     biology, history
    5  student6          Robotics

df2=

                     class  weight
    0                 math    9.00
    1              science    8.00
    2                  gym    4.00
    3            geography    6.00
    4                  art    7.00
    5            chemistry    8.50
    6              physics   10.00
    7              biology    9.00
    8              history    7.50
    9           Philosophy    9.00
   10   Physical Chemistry   10.00
   11     Computer Science    8.75

I would like to search the the 'courses' column in df1 using the possible values in df2 'class' column and then return the appropriate summed up weights.

Here's what I have so far

#extract courses
class_list = df2['class'].values.tolist()
class_list_regex= ['\\b%s\\b' % x for x in class_list]
class_list_pattern = '|'.join(class_list_regex)

final_df = df1['courses'].str.findall('('+class_list_pattern+ ')',flags=re.IGNORECASE)

resulting in

0     [Math, Science, Gym]
1    [gym, Geography, Art]
2              [Chemistry]
3     [Art, math, physics]
4       [biology, history]
Name: courses, dtype: object

how do I get something like this

    student        matched_courses      total weight
    0  student1   [Math, Science, Gym]          21.0
    1  student2  [gym, Geography, Art]          17.0
    2  student3            [Chemistry]           8.5
    3  student4   [Art, math, physics]          26.0
    4  student5     [biology, history]          16.5
    5  student6                     []           0.0

any help would be appreciated. Thanks

1 Answers1

0

Try this solution

import re
from sklearn.preprocessing import MultiLabelBinarizer

df1["courses"] = df1["courses"].apply(lambda x: re.sub('\s+', '', str(x).lower()).split(","))
mlb = MultiLabelBinarizer()
course_dummies = pd.DataFrame(mlb.fit_transform(df1["courses"]),columns=mlb.classes_, index=df1.index)
course_dummies_t=course_dummies.T.reset_index()
course_dummies_t=course_dummies_t.rename(columns={"index":"class"})
merge = pd.merge(course_dummies_t,df2,on="class",how="left")

df1["total weight"]=merge[range(0,6)].multiply(merge["weight"], axis="index").sum(axis=0)
Rajith Thennakoon
  • 3,975
  • 2
  • 14
  • 24
  • Yes, Thank you! it worked...just had to change the range to match the data frame shape. On a side note...if my df1['courses'] are not separated by a comma, how would I approach this? – fred macklin Jun 24 '20 at 02:20
  • Just tested it out on a data-frame with over 800,000 students and 75 classes...I wasn't able to run the script without getting a "exit code 137 (interrupted by signal 9: SIGKILL)" on pycharm. – fred macklin Jun 24 '20 at 02:36
  • what is the pandas or python error.this looks like not an issue of code – Rajith Thennakoon Jun 24 '20 at 03:14
  • Its looks like memory issue https://stackoverflow.com/questions/43268156/process-finished-with-exit-code-137-in-pycharm – Rajith Thennakoon Jun 24 '20 at 03:15
  • The code works perfectly!...for whatever reason it consumes a lot of memory when running the script on a larger dataset to the point where my IDE(pycharm kills the process) – fred macklin Jun 24 '20 at 03:20
  • lambda function take some memory because of it has to iterate over every row.for your first question `if my df1['courses'] are not separated by a comma` ,if list is separated from another character like `|`,still you can split using it. – Rajith Thennakoon Jun 24 '20 at 03:26