Best way to recognize same club names that are written in a different way

Question

    for x in range(len(fclub1)-1):
        for y in range(x+1,len(fclub1)-1):
            if  SequenceMatcher(None,fclub1[x], fclub1[y]).ratio() > 0.4:
                if SequenceMatcher(None,fclub2[x], fclub2[y]).ratio() > 0.4:
                    if float(fbest_odds_1[x]) < float(fbest_odds_1[y]):
                        fbest_odds_1[x] = fbest_odds_1[y]
                    if float(fbest_odds_x[x]) < float(fbest_odds_x[y]):
                        fbest_odds_x[x] = fbest_odds_x[y]
                    if float(fbest_odds_2[x]) < float(fbest_odds_2[y]):
                        fbest_odds_2[x] = fbest_odds_2[y]
                    fclub1.pop(y)
                    fclub2.pop(y)
                    fbest_odds_1.pop(y)
                    fbest_odds_x.pop(y)
                    fbest_odds_2.pop(y)

It can't reliably match club names from different bookkeeps, for example Manchester United and Man. Utd.

I tried fixing it with SequenceMatcher and making it recognize at least some part of the club name, but then it started to compare different clubs saying that they are the same:Aston Villa - Atherton Collieries and Leeds - Liversedge

Welcome to Stack Overflow. Yes, `difflib` is not an appropriate tool for the job, because the rule that you want to use - the one that will give you the desired result, that "Manchester United" and "Man. Utd." mean the same thing, but "Aston Villa" and "Atherton Collieries" do not - can't be expressed this way. You need something that *attempts to understand English* in a more sophisticated way. However, we do not offer third-party library recommendations here - [please try to research](https://meta.stackoverflow.com/questions/261592) "fuzzy string matching". — Karl Knechtel, Jan 13 '23 at 01:30
The best solution is probably the most boring, just make a list of often used names for each team and use that — Caridorc, Jan 13 '23 at 01:30
Either that or, yes, just hard-code the "matching" names, if you can know them all ahead of time. — Karl Knechtel, Jan 13 '23 at 01:31
That is sadly not possible since I'm scraping the match data from different betting sites and there are like 400 clubs for each bookkeep, is there a way it can maybe search for a number of consecutive characters that is the same — Nikola Filipovic, Jan 13 '23 at 01:38

score 0 · Answer 1 · answered Jan 13 '23 at 01:33

The best solution is probably the most boring, just make a list of often used names for each team and use that, such as:

def standardize_team_name(name):
    if name in ["Manchester", "Manchester United", "Man. Utd."]:
        return "Manchester United"
    elif name in ["Aston Villa", ...]:
        return "Aston Villa"
    elif ...

score 0 · Answer 2 · answered Jan 13 '23 at 01:53

Something that would address the particular case you mention would be

def is_abbr(shortened, full):
   short_words = shortened.replace('.', '').split(' ')
   full_words = full.split(' ')
   match = zip(short_words, full_words)
   return all(is_subseq(short, full) for short, full in match)

One coding of is_subseq is the following, taken from here:

def is_subseq(x, y):
   it = iter(y)
   return all(any(c == ch for c in it) for ch in x)

You could also look for better string comparison modules. Or build up a lookup table for abbreviations. If your program has access to the internet, you could also just do a Google search for each name and see what comes up, and write some code to process the result to figure out what football club it is.

score 0 · Answer 3 · answered Jan 13 '23 at 02:30

0

I ended up using the fuzzywuzzy library and fuzzy.partial_ratio() function

answered Jan 13 '23 at 02:30

Nikola Filipovic

11
1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 13 '23 at 11:14

Best way to recognize same club names that are written in a different way

3 Answers3