0

I have two data frames df1 and df2. In df1 I have 50 columns and in df2 I have 50+ columns. What I want to achieve is that In df1 I have 13000 rows and a column name subject where names of all subjects are given. In df2 I have 250 rows and along 50+ I have two columns named subject code and subject_name.

        Here is an example of my datasets:

        df1 = 
        index     subjects
        0         Biology
        1         Physicss
        2         Chemistry
        3         Biology
        4         Physics
        5         Physics
        6         Biolgy

    df2 = 
        index     subject_name    subject_code
        0         Biology         BIO
        1         Physics         PHY
        2         Chemistry       CHE
        3         Medical         MED
        4         Programming     PRO
        5         Maths           MAT
        6         Literature      LIT 

My desired output in df1 (after replacing subject_name and fixing the spelling errors) is:
            index     subjects        subject_code
            0         Biology         BIO
            1         Physics         PHY
            2         Chemistry       CHE
            3         Biology         BIO
            4         Physics         PHY
            5         Physics         PHY
            6         Biology         BIO

What happens at my end is that I want to merge all subject values in df1 with values in df2 subject name values. In df1 there are around 500 rows where I get NAN after I merge both the columns in one as in these 500 rows there is some difference in the spellings of the subject. I have tried solution given at the following links but didn't work for me: replace df index values with values from a list but ignore empty strings

Python pandas: replace values multiple columns matching multiple columns from another dataframe

            Here is my code:

            df_merged = pd.merge(df1_subject,df2_subjectname, left_on='subjects', right_on='subject_name')
        df_merged.head()

Can anyone tell me how I can fix this issue as I have already spend 8 hours on this issue but am unable tor resolve it.

Cheers

Community
  • 1
  • 1
Baig
  • 469
  • 2
  • 7
  • 19
  • Did you inspect those rows in ```df1```, is there something different in the spelling? ...```df1``` column name is ```subjects``` but you used ```left_on = 'subject'``` in the merge argument. – wwii Sep 28 '16 at 16:41
  • That was a typo. I have fixed the spelling error and correct one is left_on = 'subjects' – Baig Sep 28 '16 at 16:45
  • Have you tried to merge based on the first three characters being the sane? – wwii Sep 30 '16 at 05:02
  • HI wwii, actually that will not work as in my dataframe I have few subjects like web design, web development, etc and if I match first three characters then it will take wrong values. – Baig Sep 30 '16 at 13:29
  • Have you characterized the misspellings? How many unique misspellings for each term are there? - something like df1.subjects not in df2.subject_names then a groupby. Maybe you can automate correcting the misspellings then do a merge. – wwii Oct 07 '16 at 00:18

2 Answers2

0

One of the issues you have is the incorrect spellings. You can try to harmonise the spelling of the subject across both dataframes using the difflib module and its get_close_matches method.

Using this code will return the closest matching subject for each match in df1 and and df2. df1's columns will be updated to reflect this. Therefore, even if the subject name is spelled incorrectly, it will now have the same spelling in both dataframes.

import pandas as pd
import difflib

df2['subject_name'] = df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])

After this you can try to merge. It may resolve your issue, but it would be easier to fix if you provide a reproducible example.

conor
  • 1,267
  • 10
  • 7
  • Hi Conor, If you don't mind can you please tell me what this x is inside get_close_matches(x,...) as I am getting error list index out of range – Baig Sep 28 '16 at 17:56
  • try this instead `df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject[0])` – conor Sep 28 '16 at 18:04
  • The `x` is simply is a place holder for the variable `subject_name` in df2. – conor Sep 28 '16 at 18:05
  • Sorry Conor with this one I am getting unexpected EOF while parsing error – Baig Sep 28 '16 at 18:08
  • That type of error relates to syntax, also there was a small mistake in my code. This line `df2.subject_name.map(lambda x: difflib.get_close_matches(x, df1.subject)[0])` works in a sample dataset that I set up. It is worth one more try :) – conor Sep 28 '16 at 18:18
0

Correct the spelling then merge...

import pandas as pd
import operator, collections

df1 = pd.DataFrame.from_items([("subjects",
                                ["Biology","Physicss","Phsicss","Chemistry",
                                 "Biology","Physics","Physics","Biolgy","navelgazing"])])
df2 = pd.DataFrame.from_items([("subject_name",
                                ["Biology","Physics","Chemistry","Medical",
                                 "Programming","Maths","Literature"]),
                               ("subject_code",
                                ["BIO","PHY","CHE","MED","PRO","MAT","LIT"])])

Find the misspellings:

misspelled = set(df1.subjects) - set(df2.subject_name)

Find the subject that matches the misspelling best and create a dictionary -> {mis_sp : subject_name}

difference = operator.itemgetter(1)
subject = operator.itemgetter(0)
def foo1(word, candidates):
    '''Returns the most likely match for a misspelled word
    '''
    temp = []
    for candidate in candidates:
        count1 = collections.Counter(word)
        count2 = collections.Counter(candidate)
        diff1 = count1 - count2
        diff2 = count2 - count1
        diff = sum(diff1.values())
        diff += sum(diff2.values())
        temp.append((candidate, diff))
    return subject(min(temp, key = difference))

def foo2(words):
    '''Yields (misspelled-word, corrected-word) tuples from misspelled words'''
    for word in words:
        name = foo1(word, df2.subject_name)
        if name:
            yield (word, name)

d = dict(foo2(misspelled))               

Correct all the misspellings in df1

def foo3(thing):
    return d.get(thing, thing)

df3 = df1.applymap(foo3)

Merge

df2 = df2.set_index("subject_name")
df3 = df3.merge(df2, left_on = "subjects", right_index = True, how = 'left')

foo1 might possibly be sufficient for this purpose, but there are better, more sophisticated, algorithms to correct spelling. maybe, http://norvig.com/spell-correct.html

Just read @conner's solution. I didn't know difflib was there so a better foo1 would be,

def foo1(word, candidates):
    try:
        return difflib.get_close_matches(word, candidates, 1)[0]
    except IndexError as e:
        # there isn't a close match
        return None
wwii
  • 23,232
  • 7
  • 37
  • 77