Replace all instances of a string on one dataframe that are found in another dataframe

Question

I have a dataframe that looks like this:

and I have another dataframe that looks like this

What I want to do is see if there is a match in the first dataframe resolution column with the column resolution_sort and if there is a match I want to replace all instances of that string with the string inside match_sort.

I know a brute force sort of way of doing this but I wanted to see if someone had a more pythonic way of doing this instead of using two for loops to do this.

I want to note that in the second dataframe I dropped any duplicates in the resolution sort column so we all unique values there. Here is some code that is not efficient that I think will work:

for x in first_df:
    for y in second_df:
        if x.loc['resolution'] == y.loc['resolution_sort']:
            first_df.replace({x.loc['resolution']: y.loc['match_sort']}, regex=True)

Good presentation of your objective. But please also provide the data as **code** that people can run and generate a test case. This helps people to answer quickly and also helps you or anyone else to compare the relevance and quality of the answers on a common set of dummy data. Also please mention any assumptions: _for the second dataframe, every entry in the column **resolution_sort** is unique_ and you could have case-mismatch while comparing the two columns. So that should be taken into account as well. — CypherX, Nov 04 '20 at 20:02
Please provide the data as code: not just as images (they are helpful to understand what you need, but not useful to generate the two dataframes). — CypherX, Nov 04 '20 at 20:12
@CypherX Do you know the best way to provide that data as code is? — Snorrlaxxx, Nov 04 '20 at 20:18
If I were you, I would just create a dataframe with some sample data and share the code here. For example, see the dummy data in my answer. You can directly run the dummy data and then run the solution on it for quick verification. — CypherX, Nov 04 '20 at 20:51

score 1 · Accepted Answer · answered Nov 04 '20 at 20:40

As mentioned by the commenters, it's hard if there's no sample data to test on, but my solution would be to join the dataframes by using .merge() and then combine columns match_sort and resolution.

third_df = pd.merge(
    first_df, 
    second_df[['resolution_sort', 'match_sort']], 
    how='left',
    left_on='resolution',
    right_on='resolution_sort',
)

third_df['combined'] = third_df['match_sort'].combine_first(third_df['resolution'])

score 1 · Answer 2 · answered Nov 04 '20 at 20:48

Solution

You can try this. I am using pandas.Series.replace() to only replace the matched occurences and leave the rest unchanged. If instead, you use pandas.Series.map() that will enforce non-matches to become NAN values.

## Assumptions
# dfa: 1st dataframe
# dfb: 2nd dataframe

# Option-1: single line solution
(dfa.resolution.str.lower().replace(dict(dfb[['resolution_sort', 'match_sort']].to_records(index=False))))

# Option-2: broken into multi-line for more clarity
(dfa.resolution
    # convert to lower-case before comparison
    .str.lower() 
    # match and replace
    .replace(dict(
        # create a dict for "match-and-replace"
        dfb[['resolution_sort', 'match_sort']].to_records(index=False) 
        )) 
)

Output:

0     replaced_keyboard
1        mouse_replaced
2        replaced_mouse
3     replaced_keyboard
4     replaced_keyboard
5        replaced_mouse
6     replaced_keyboard
7     replaced_keyboard
8        replaced_mouse
9        mouse_replaced
10       mouse_replaced
11          only in dfa
Name: resolution, dtype: object

Dummy Data

# dfa: 1st dataframe (to update resolution column here)

resolution = [
    'replaced keyboard', 
    'mouse replaced', 
    'replaced mouse',
    'Replaced Keyboard',
    'replaced keyboard', 
    'replaced mouse', 
    'Replaced Keyboard', 
    'replaced Keyboard', 
    'replaced Mouse', 
    'Mouse replaced',
    'mouse replaced' ,
    'only in dfa',
]

# dfb: 2nd dataframe (for lookup)

resolution_sort = [
    'replaced keyboard', 
    'replaced mouse', 
    'mouse replaced', 
    'only in dfb',
]

match_sort = [
    'replaced_keyboard', 
    'replaced_mouse', 
    'mouse_replaced', 
    'only_in_dfb', 
]

# Create dataframes
dfa = pd.DataFrame(dict(resolution=resolution))
dfb = pd.DataFrame(
    dict(
        resolution_sort = resolution_sort, 
        match_sort = match_sort,
    )
)

Output:

# print(dfa)

           resolution
0   replaced keyboard
1      mouse replaced
2      replaced mouse
3   Replaced Keyboard
4   replaced keyboard
5      replaced mouse
6   Replaced Keyboard
7   replaced Keyboard
8      replaced Mouse
9      Mouse replaced
10     mouse replaced
11        only in dfa

# print(dfb)

     resolution_sort         match_sort
0  replaced keyboard  replaced_keyboard
1     replaced mouse     replaced_mouse
2     mouse replaced     mouse_replaced
3        only in dfb        only_in_dfb

Replace all instances of a string on one dataframe that are found in another dataframe

2 Answers2

Solution

Dummy Data