0

I have two df's

df1

    date            League                                 teams
0   201902272215    brazil cup                             foz do iguacu fcceara ce
1   201902272300    colombia primera a                     deportes tolimaatletico bucaramanga
2   201902272300    brazil campeonato gaucho 2nd division  ypiranga rsuniao frederiquense
3   201902272300    brazil campeonato gaucho 2nd division  esportivo rstupi rs
4   201902272300    brazil campeonato gaucho 2nd division  sao paulo rsgremio esportivo bage
14  201902280000    four nations women tournament (in usa) usa (w)japan (w)
25  201902280030    bolivia professional football league   real potosibolivar

df2

    date            league                     teams
0   201902280000    womens international       usa womenjapan women
1   201902280000    brazil amazonense          sul america ecrio negro am
2   201902280030    bolivia apertura           real potosibolivar
3   201902280030    brazil campeonato paulista palmeirasituano
4   201902280030    copa sudamericana          racing clubcorinthians

The result I would want is all the rows from df2 that are near matches with df1

    date            league                     teams                      near_match
0   201902280000    womens international       usa womenjapan women       1
1   201902280000    brazil amazonense          sul america ecrio negro am 0
2   201902280030    bolivia apertura           real potosibolivar         1
3   201902280030    brazil campeonato paulista palmeirasituano            0
4   201902280030    copa sudamericana          racing clubcorinthians     0 

I have tried to use a variation of a for loop using SequenceMatcher and setting the threshold to a match of above 0.8, but haven't had any luck.

df_1['merge_teams'] = df_1['teams'] # we will use these as the merge keys
df_1['merge_date'] = df_1['date']
# df_1['merge_league'] = df_1['league']


for teams_1, date_1, league_1 in df_1[['teams','date']].values:
    for ixb, (teams_1, teams_2) in enumerate(df_2[['teams','date']].values):
        if difflib.SequenceMatcher(None,teams_1,teams_2).ratio() > .8:
            df_2.ix[ixb,'merge_teams'] = teams_1 # creates a merge key in df_2
        if difflib.SequenceMatcher(None,date_1, date_2).ratio() > .8:
            df_2.ix[ixb,'merge_date'] = date_1 # creates a merge key in df_2

# This should rturn all rows where teams,date and league all match by over 80%
# This is just for teams and date, I want to include league as well

Any advice or guidance would be greatly appreciated.

Joe
  • 59
  • 6

0 Answers0