Elegant way to do fuzzy map based on a mix of substring and string in pandas

Question

I have two dataframes mapp and data like as shown below

mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]})

data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})

I would like fetch a value from data and check whether it is present in mapp, if yes, then get the corresponding concept_id value. The priority is to first look for an exact match. If no match is found, then go for substring match. As I am dealing with more than million records, any scalabale solution is helpful

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)

produces an output like below

When I do substring match, valid records also become NA as shown below

data['concept_id'] = data['sourcevalue'].str[:3].map(s)

I don't know why it's giving NA for valid records now

How can I do this two checks at once in an elegant and efficient manner?

I expect my output to be like as shown below

All the answers below were good. However @Jezrael was able to help me understand what's the issue with my approach and also provide an elegant and easy solution for the question. However Erfan's answer was also good to know. Nonetheless, upvoted all answers — The Great, Aug 23 '19 at 09:21

jezrael · Accepted Answer · 2019-08-23T09:18:57.563

3

If need map by strings and first 3 letters create 2 separate Series and then use Series.fillna or Series.combine_first for replace missing values from a by b:

s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s) 
b = data['sourcevalue'].str[:3].map(s)

data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
    sourcevalue  concept_id
0  d22heartabcd         1.0
1       Studyid         2.0
2      noofsons         3.0
3         Level         4.0
4     d21abcdef         5.0

EDIT:

#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22         1
Studyid     2
noofsons    3
Level       4
d21         5
Name: concept_id, dtype: int64

#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22    1
Stu    2
noo    3
Lev    4
d21    5
Name: concept_id, dtype: int64

#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0    1.0
1    NaN
2    NaN
3    NaN
4    5.0
Name: sourcevalue, dtype: float64

#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0    1
1    2
2    3
3    4
4    5
Name: sourcevalue, dtype: int64

edited Aug 23 '19 at 09:18

answered Aug 23 '19 at 09:02

jezrael

822,522
95
1,334
1,252

but when I do `:3` all of it should match. Am I right? Why does it give NA? – The Great Aug 23 '19 at 09:03
I mean `:3` should results in exact match for all the records between two dataframes for first 2 chars. Am I right? but it gives NA. Any idea why? – The Great Aug 23 '19 at 09:05
@SSMK - `str[:3]` select first 3 letters, if missing value then output is `NaN` – jezrael Aug 23 '19 at 09:11
In our case, first 3 letters match for all records, then why does it return `NA` – The Great Aug 23 '19 at 09:11
@SSMK - Solution first match strings, if not match are NaNs in `a`. Then match first 3 letters and save to `b`. Last replace missing values from `a` by `b` values. – jezrael Aug 23 '19 at 09:13
Actually, In my code, when I executed `data['sourcevalue'].str[:3].map(s)`, I was expecting it will match all records because first 3 chars are same. But it returns NA – The Great Aug 23 '19 at 09:14
@SSMK - i got you. – jezrael Aug 23 '19 at 09:15
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198358/discussion-between-ssmk-and-jezrael). – The Great Aug 23 '19 at 09:15
@SSMK - reason is you replace all strings by 3 letters strings. If repalce 3 letetrs strings by 3 letetrs strings it working – jezrael Aug 23 '19 at 09:15
Sorry, didn't get you. Can you give an example? – The Great Aug 23 '19 at 09:16
1

Marked as answer, Thank you – The Great Aug 23 '19 at 09:22

score 2 · Answer 2 · answered Aug 23 '19 at 09:06

Using the fuzzy_merge function I wrote:

new = fuzzy_merge(data, mapp, 'sourcevalue', 'variable')\
       .merge(mapp, left_on='matches', right_on='variable')\
       .drop(columns=['matches', 'variable'])

Output

    sourcevalue  concept_id
0  d22heartabcd           1
1       Studyid           2
2      noofsons           3
3         Level           4
4     d21abcdef           5

Function used from linked answer:

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match
    limit is the amount of matches will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

score 2 · Answer 3 · answered Aug 23 '19 at 09:08

2

Try this. in this we are locating NA values after the first map & do the sub-string mapping on them

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)
data.loc[data['concept_id'].isnull(),'concept_id' ]= data['sourcevalue'].str[:3].map(s)

answered Aug 23 '19 at 09:08

moys

7,747
2
11
42

But why does `:3` return `NA` for other records. When they satisy `exact match`, it should not return `NA` for this scenario. Am I right? Upvoted the answer – The Great Aug 23 '19 at 09:09
I mean when I just executed this, `data['concept_id'] = data['sourcevalue'].str[:3].map(s)`, I was expecting every record to match and get a corresponding id – The Great Aug 23 '19 at 09:11

Elegant way to do fuzzy map based on a mix of substring and string in pandas

3 Answers3