merge two datasets based on a string match

Question

I have two datasets;

DF1 contains a column that has a short word.

DF2 has a column with a series of words, where in some cases they start with the short word from DF1

I would like to create a new column in DF2, and everytime the DF1 word appears at the beginning of DF2's word, to then place that DF1 word into the newly created column at that location.

The idea being that once completed, I can then merge the two datasets using the matching word.

How would I do this?

DF1

ref
ABC
DEF
GHI

DF2

word
ABC123
DEF456
GHI789

DF2 - Desired output

word	new column
ABC123	ABC
DEF456	DEF
GHI789	GHI

Corralien · Accepted Answer · 2022-03-31T14:07:52.143

1

You can create a list of pattern from ref column then try to extract it from word column:

import re

pattern = fr"({'|'.join(re.escape(x) for x in df1['ref'])})"
df2['new column'] = df2['word'].str.extract(pattern)
print(df2)

# Output
     word new column
0  ABC123        ABC
1  DEF456        DEF
2  GHI789        GHI

If it's too simple, you have to use fuzzy logic. You can check this Question & Answer

edited Mar 31 '22 at 14:07

answered Mar 30 '22 at 14:54

Corralien

109,409
8
28
52

When trying on a larger data set, I'm getting a key error message on that 'new column' created, followed by another exception error message; 'ValueError: Expected a 1D array, got an array with shape' Any ideas? – Jr Tee Mar 31 '22 at 13:07
Is it possible for you to share your data or a sample that raises this exception? – Corralien Mar 31 '22 at 13:11
Ok. Can you update your post with a new row that can failed, please? – Corralien Mar 31 '22 at 13:56
Managed to work it out. There were parenthesis in my pattern data, which was telling Regex to create new capture groups. Thanks :) – Jr Tee Mar 31 '22 at 14:05
I updated my answer. Use `re.escape` to prevent this case. It should work now. – Corralien Mar 31 '22 at 14:08
Sorry, and how would I go about doing the same match but where 'ref' appears at the start of 'word'? Imagine it would be using something along the lines of str.startswith? – Jr Tee Mar 31 '22 at 14:08
If you want to match the start of the word, use: `pattern = fr"^({'|'.join(re.escape(x) for x in df1['ref'])})"` Note the *`^`* – Corralien Mar 31 '22 at 14:10

merge two datasets based on a string match

1 Answers1