0

I'm starting to use pandas and I came across a problem that I don't know how to solve.

I have two dataframes.

The first contains car information including the car model (column DESCR_MARCA_VEICULO)

df1
col1   col2   DESCR_MARCA_VEICULO
....   ....   'GM/CELTA 5 PORTAS SUPER'
....   ....   'VW/VOYAGE LS'
....   ....   'VW/GOL LS'
....   ....   'I/AUDI A4 2.0T FSI'
....   ....   'FIAT/UNO CS IE'

The second contains a two-column de-para containing the car model and a unique ID associated with that model, like that:

df2
ID     DESCR_MARCA_VEICULO
1      'GM - CELTA 5'
2      'VW - VOYAGE LS'
3      'VW - GOL LS'
4      'ACURA - INTEGRA GS 1.8'
5      'AUDI - 80 S2 AVANT'

And it doesn't necessarily follow a pattern like replacing "/" with " - " or something.

However, I have more than 5000 different car models in DF1 (what makes it impossible for me to look case by case) and I need to combine DF1 and DF2 bringing the ID column to DF1 (it would be a merge). However, when I merge the dfs, there is no match because of these differences in strings.

Is there any way I can merge these dfs by the similarity between the strings in the DESCR_MARCA_VEICULO column?

Thank you :)

  • just a guess, but the second column has just the brand name and the model (GM - CELTA). you can extract that on multiple splits, using str.split and use a regex for the pattern. if you could provide an example with a mix of different delimiters, i'm sure folks here can work sth out. – sammywemmy Jan 25 '20 at 02:00
  • I added a few lines to the examples.However, it does not necessarily follow a logic. because both columns have the brand and the pure model like VW GOL or AUDI A3, but sometimes they put as many valves or cylinders as they have – Danilo Perl Jan 25 '20 at 02:15

1 Answers1

0

I recommend you look into fuzzy matching with the fuzzywuzzy package

one hack is to remove all characters from your strings except letters & numbers

df1['var'] = df1['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')
df2['var'] = df2['DESCR_MARCA_VEICULO'].str.replace('[^A-Za-z0-9]','')

m = pd.merge(df1,df2,on='var').drop('var',axis=1)

print(m)

    col1     col2 DESCR_MARCA_VEICULO_x     matches  ID DESCR_MARCA_VEICULO_y
0  ....     ....           VW/VOYAGE LS  VWVOYAGELS   2        VW - VOYAGE LS
1  ....     ....              VW/GOL LS     VWGOLLS   3           VW - GOL LS

how ever there are several missing keys that we need to match : lets use Erfan's answer from this post

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
    """
    df_1 is the left table to join
    df_2 is the right table to join
    key1 is the key column of the left table
    key2 is the key column of the right table
    threshold is how close the matches should be to return a match, based on Levenshtein distance
    limit is the amount of matches that will get returned, these are sorted high to low
    """
    s = df_2[key2].tolist()

    m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))    
    df_1['matches'] = m

    m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
    df_1['matches'] = m2

    return df_1

print(fuzzy_merge(df1, df2, 'var', 'var', threshold=80))

    col1     col2      DESCR_MARCA_VEICULO                  var     matches
0  ....     ....   GM/CELTA 5 PORTAS SUPER  GMCELTA5PORTASSUPER    GMCELTA5
1  ....     ....              VW/VOYAGE LS           VWVOYAGELS  VWVOYAGELS
2  ....     ....                 VW/GOL LS              VWGOLLS     VWGOLLS
3  ....     ....        I/AUDI A4 2.0T FSI        IAUDIA420TFSI            
4  ....     ....            FIAT/UNO CS IE          FIATUNOCSIE   
Umar.H
  • 22,559
  • 7
  • 39
  • 74
  • thanks for help! However, df1 has approximately 2 million lines and the Def fuzzy match is already running for 12 hours. Is there anything I can do to optimize time? – Danilo Perl Jan 26 '20 at 14:50
  • @DaniloPerl I think not as you're dealing with strings, maybe keywords you can extract.. but i'm not an expert, why not take a look at the linked post or perhaps do this task in `Dask` or `pyspark` or another data processing software – Umar.H Jan 26 '20 at 14:51
  • 1
    I also left running the algorithm of the difflib package covered in the post quoted. I'll wait and see which one ends first. my question is whether i applied difflib correctly. But I'll see later and put The result here :) – Danilo Perl Jan 26 '20 at 19:20
  • a doubt: if i put n = 1 in the function, does it decrease the processing time a lot? – Danilo Perl Jan 27 '20 at 15:15
  • I'm not sure as it's using a different package and your working with strings, it's best to ask a new question asking how to optimize this solution - I'm sure someone will know. Else the only thing I can think of is throwing more compute so it runs faster, that's where things like `Dask` come into play or `Pyspark` – Umar.H Jan 27 '20 at 15:44
  • I had an idea. for both dfs I will break the model column in two: brand and model itself. from there I can apply this algorithm to each model of my df1 and look for a similar model in df2 already filtering by the car brand you know? – Danilo Perl Jan 27 '20 at 16:43
  • @DaniloPerl yes that's what i had meant with `Keywords` above, you can split it out by model then apply the function on each group, would work much faster. Give it a go and let me know how it goes, curious to see the performance increase! – Umar.H Jan 27 '20 at 16:44