3

I have two dataframes mapp and data like as shown below

mapp = pd.DataFrame({'variable': ['d22','Studyid','noofsons','Level','d21'],'concept_id':[1,2,3,4,5]})

data = pd.DataFrame({'sourcevalue': ['d22heartabcd','Studyid','noofsons','Level','d21abcdef']})

enter image description here

enter image description here

I would like fetch a value from data and check whether it is present in mapp, if yes, then get the corresponding concept_id value. The priority is to first look for an exact match. If no match is found, then go for substring match. As I am dealing with more than million records, any scalabale solution is helpful

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s) 

produces an output like below

enter image description here

When I do substring match, valid records also become NA as shown below

data['concept_id'] = data['sourcevalue'].str[:3].map(s)

enter image description here

I don't know why it's giving NA for valid records now

How can I do this two checks at once in an elegant and efficient manner?

I expect my output to be like as shown below

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128
  • All the answers below were good. However @Jezrael was able to help me understand what's the issue with my approach and also provide an elegant and easy solution for the question. However Erfan's answer was also good to know. Nonetheless, upvoted all answers – The Great Aug 23 '19 at 09:21

3 Answers3

3

If need map by strings and first 3 letters create 2 separate Series and then use Series.fillna or Series.combine_first for replace missing values from a by b:

s = mapp.set_index('variable')['concept_id']
a = data['sourcevalue'].map(s) 
b = data['sourcevalue'].str[:3].map(s)

data['concept_id'] = a.fillna(b)
#alternative
#data['concept_id'] = a.combine_first(b)
print (data)
    sourcevalue  concept_id
0  d22heartabcd         1.0
1       Studyid         2.0
2      noofsons         3.0
3         Level         4.0
4     d21abcdef         5.0

EDIT:

#all strings map Series
s = mapp.set_index('variable')['concept_id']
print (s)
variable
d22         1
Studyid     2
noofsons    3
Level       4
d21         5
Name: concept_id, dtype: int64

#first 3 letters map Series
s1 = mapp.assign(variable = mapp['variable'].str[:3]).set_index('variable')['concept_id']
print (s1)
variable
d22    1
Stu    2
noo    3
Lev    4
d21    5
Name: concept_id, dtype: int64

#first 3 letters map by all strings
print (data['sourcevalue'].str[:3].map(s))
0    1.0
1    NaN
2    NaN
3    NaN
4    5.0
Name: sourcevalue, dtype: float64

#first 3 letters match by 3 first letters map Series
print (data['sourcevalue'].str[:3].map(s1))
0    1
1    2
2    3
3    4
4    5
Name: sourcevalue, dtype: int64
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

Using the fuzzy_merge function I wrote:

new = fuzzy_merge(data, mapp, 'sourcevalue', 'variable')\
       .merge(mapp, left_on='matches', right_on='variable')\
       .drop(columns=['matches', 'variable'])

Output

    sourcevalue  concept_id
0  d22heartabcd           1
1       Studyid           2
2      noofsons           3
3         Level           4
4     d21abcdef           5

Function used from linked answer:

def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match
    limit is the amount of matches will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1
Erfan
  • 40,971
  • 8
  • 66
  • 78
2

Try this. in this we are locating NA values after the first map & do the sub-string mapping on them

s = mapp.set_index('variable')['concept_id']
data['concept_id'] = data['sourcevalue'].map(s)
data.loc[data['concept_id'].isnull(),'concept_id' ]= data['sourcevalue'].str[:3].map(s)
moys
  • 7,747
  • 2
  • 11
  • 42
  • But why does `:3` return `NA` for other records. When they satisy `exact match`, it should not return `NA` for this scenario. Am I right? Upvoted the answer – The Great Aug 23 '19 at 09:09
  • I mean when I just executed this, `data['concept_id'] = data['sourcevalue'].str[:3].map(s)`, I was expecting every record to match and get a corresponding id – The Great Aug 23 '19 at 09:11