0

I am trying to create an odds matcher using python that compares game names using pandas. The problem I am having is if the data is not a 100% match, then it will not recognise the game name.

Is there an efficient way to match game names? E.g a percentage match. Fuzzy lookup? I cannot think of a reliable way to do this as to minimise errors. Any ideas how this might be achieved through Python?

          a              b           c           d    e
EC Bahia v Salvador U20 2.3 EC Bahia v Salvador 2.3 NaN     
EC Bahia v Salvador     2.3 EC Bahia v Salvador 2.3 Match   You could get the first word before v and after but….   
Bahai Samone v Salvator 2.3 EC Bahia v Salvador 2.3 Match   However this causes problem when the string was Ec FAHI (different) 

df1

                      EW                          WE    DA  \
0                      k                           k     2   
1  EC Bahia Salvador U20  Clube Atletico Mineiro U20   2.3   
2             Moreirense                     Rio Ave  1.62   
3               EC Bahia                Salvadoa U20    14   
4               EC Bahia                    Salvador  4141   

    DD  
0  https://www.b1 
1  https://www.b1 
2  https://www.b1 
3  https://www.b1 
4  https://www.b1 

df2

                AA            AB                AC    AD  \
0    Starting soon             k                 k  3.15   
1          In-Play  FC Nitra U19  Z Michalovce U19  9.60   
2          In-Play   Sevilla U19    NK Maribor U19   NaN   
3          In-Play    Moreirense           Rio Av   1.02   
4  Starting in 13'      EC Bahia          Salvador  1.07   

      AE  
0  https://www.be
1  https://www.be
2  https://www.be
3  https://www.be
4  https://www.be

Desired:

              AA   AB         AC             AD  \
0  Starting soon    k          k            3.15   
1  Starting in 13'  EC        Bahia         9.60   
1        In-Play  Moreirense  Rio Av        1.02   

              AE         EW        WE  \
0  https://www.b1     k v k         2   
1  https://www.b2   EC v Bahia      4141   
3  https://www.b3 Moreirense v Rio Av 1.02   

Formula:

df1['EW'] = df1['EW'] + ' v ' + df1['EW']
df1['WE'] = df1['DA']
df1['DA'] = df1['DD']


df2['EW'] = df2['AB'] + ' v ' + df2['AC']


print('kk')

df3 = pd.merge(df2, df1, on='EW')

1 Answers1

0

You're basically asking "what are all the ways to do string comparisons in Python" which is HUGE question.

Some real basic stuff would be to do some string formatting:

  • make everything lowercase (e.g here)
  • remove punctuation (e.g here)
  • maybe remove spaces

Some more advanced stuff would be:

  • fuzzy matching (I like the fuzzywuzzy package).
  • vectorization + cosine similarity (e.g here)

Beyond that you really have to deep dive into each of this IMHO.

Ido S
  • 1,304
  • 10
  • 11