0

I'm doing something that I know that I shouldn't be doing. I'm doing a for loop within a for loop (it sounds even more horrible, as I write it down.) Basically, what I want to do, theoretically, using two dataframes is something like this:

for index, row in df_2.iterrows():
    for index_1, row_1 in df_1.iterrows():
        if row['column_1'] == row_1['column_1'] and row['column_2'] == row_1['column_2'] and row['column_3'] == row_1['column_2']:
            row['column_4'] = row_1['column_4']

There has got to be a (better) way to do something like this. Please help!

kfp_ny
  • 39
  • 1
  • 2
  • 7
  • 2
    You can do a inner join on those columns, then you get a new df with all rows you want. Iterate over than if you like. – greedy52 Feb 28 '17 at 20:06
  • Another complication is that I may need to do some fuzzy string matching (the columns have typos). In that case, inner join is not an option for me. – kfp_ny Feb 28 '17 at 20:41

1 Answers1

1

As pointed out by @Andy Hayden in is it possible to do fuzzy match merge with python pandas?, you can use difflib : get_closest_matches function to create new join columns.

import difflib
df_2['fuzzy_column_1'] = df_2['column_1'].apply(lambda x: difflib.get_close_matches(x, df_1['column_1'])[0])
# Do same for all other columns

Now you can apply inner join using pandas merge function.

result_df = df_1.merge(df_2,left_on=['column_1', 'column_2','column_3'], and right_on=['fuzzy_column_1','fuzzy_column_2','fuzzy_column_3] )

You can use drop function to remove unwanted columns.

Community
  • 1
  • 1
Spandan Brahmbhatt
  • 3,774
  • 6
  • 24
  • 36
  • Thanks! I was thinking about using `fuzzwuzz` module, but this seems better. – kfp_ny Mar 03 '17 at 01:00
  • Sorry for the late response. I think it is working, but `get_closest_matches` forces a choice, and sometimes that isn't appropriate in my case. So I'm back to using `fuzzywuzzy`. – kfp_ny Mar 14 '17 at 18:40