0

I have two data frames

df1

   model_detail                                        series_detail
Nissan Sentra 430 Sedan -1123                            Nissan Sentra 
Acura ILX150 Base Sedan -17652                          Acura ILX Sedan
Acura TLX300 Base FWD -89789                            Acura TLX Sedan
Acura MDX450 Advanced Electric Hybrid AWD -55647        Acura MDX Sedan

df_2

  Name           model                      Series      Amount         
  Acura       MDX 450 Hybrid                 MDX        45900
  Acura       ILX 150 Sedan                  ILX        156700
  Nissan      Sentra 430 Electric           Sentra      88897
  Nissan      Sunny 150                     Sunny       12333
  Acura       MDX 450 Electric               MDX        90000

I need to merge df1 and df2 to get the amount column in the df1.

Converting the column values similar to df2 is a very big process because I have 40 car brands and all are in a different format. It is impossible to write one code for all the brands as there would be many conditions.

Is there any other way where I can join these two data frames? Does fuzzy match help?

Or is there any other alternative? Because the Amount column is very sensitive it is important to have accuracy.

Can any one help me with a code to merge these type of data frames? Thanks a lot in advance.

What I have been doing brand-wise:

def model_extractfunc(row):
    blocks = row['model_detail'].split()
    mid_index = 1
    if ('Hybrid' in blocks) or ('Electric' in blocks) or('Connect' in blocks):
        mid_index = 3
    return ' '.join(blocks[1:mid_index+1])

df1['extract_model'] = df1.apply(model_extractfunc, axis=1)

Again after extract_model column is generated - I have been checking the strings and replacing them according to the df2

repeating the same process for series_detail column to get Series column for df2

  • My approach is very time-taking and in-efficient as some brands have too many models and series details that I have to extract and if-else loops - would be inefficient in that cases.

  • Also, there is a possibility of Human errors as I have to manually check if all models are getting covered and replace names accordingly

ar_mm18
  • 415
  • 2
  • 8
  • 1
    Does this answer your question? [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – Michael Delgado Sep 23 '22 at 17:44
  • No @MichaelDelgado. I tried doing that but the preprocessing of a column to to fuzzy match is taking a lot of time – ar_mm18 Sep 23 '22 at 17:46
  • well - that doesn't mean it's the wrong approach. you just have a large dataframe. so... wait longer? if you have specific requirements around performance/size/etc then list them. as your question is currently written, it's an exact duplicate. – Michael Delgado Sep 23 '22 at 17:49
  • 1
    also... 40 isn't a large number. if you need to guarantee a precise match a manual mapping isn't a bad option. just put all possible brands in a dictionary and standardize them with [`pd.Series.replace`](https://pandas.pydata.org/docs/reference/api/pandas.Series.replace.html). Any automated solution which merges strings with name variations will be subject to uncertainty and/or error. putting that accuracy is important in bold text doesn't make that problem go away ;) if this is going to be a standalone (non-duplicate) question you need to show us what you've tried - see [ask] – Michael Delgado Sep 23 '22 at 17:56
  • I have updated the question with what I have been doing brandwise. I hope my question is clear now. @MichaelDelgado – ar_mm18 Sep 23 '22 at 18:13
  • Do you think my approach is the right way for accuracy? Or any other suggestions would be really helpful. There are so many clauses for each brand so I had to do it brand-wise rather than one single code. @MichaelDelgado – ar_mm18 Sep 23 '22 at 18:15
  • 1
    this isn't a question that is answerable on SO. this is just a data cleaning exercise. I'd recommend finding the list of unique values using `df[columnname].unique()`, create a dictionary mapping all options to what you want, and use series.replace(). or, use fuzzywuzzy. up to you. I would not recommend a giant set of if-else blocks. but this is a style question and not [on-topic](/help/on-topic) here. – Michael Delgado Sep 23 '22 at 18:17

0 Answers0