1

I have a long list of car ad titles and another list of all car makes and models, I am searching the titles to find a match in the makes/models list. I have this so far:

    for make in carmakes:
        if make in title:
            return make

but it doesn't work too well as the titles are human made and come with a lot of variations. For example, if the title is 'Nissan D-Max' and i have 'dmax' in my makes/models list, the loop doesn't catch that as it doesn't match exactly. What's the best way to 'loosely' or 'dynamically' check for matches?

moo5e
  • 63
  • 1
  • 7
  • 3
    check if this is something you can work with: https://github.com/seatgeek/fuzzywuzzy – Ofer Sadan Nov 23 '19 at 09:06
  • 5
    Handling arbitrary user input and using it to search is a huge topic. The only answer anyone can give you here is that it depends on your data and just how far you want to go with this. Regex is most likely not a good answer to this problem if you have arbitrary user input. Regex literally means "regular expressions" and there's nothing regular about your input. – Peter Nov 23 '19 at 09:08
  • 1
    You could research how search engines normalise search requests (for example do they remove punctuation, or replace it in specific ways?) because this example of adapting human input might help you. – DisappointedByUnaccountableMod Nov 23 '19 at 10:05
  • 2
    You could use the distance between 2 strings instead of re. See for instance https://stackoverflow.com/questions/17388213/find-the-similarity-metric-between-two-strings. Then choose of a threshold to decide if substring is close enough to the title. – Demi-Lune Nov 23 '19 at 10:06
  • 1
    The probability of one string matching with another seems like the most promising so I'll definitely give that a look! Thank you – moo5e Nov 23 '19 at 12:53
  • 1
    You could remove everything which isn't a space, a letter or a number first, and make it lowercase. `D-Max` would become `dmax` and it would be easier to find it in a string. – Eric Duminil Nov 23 '19 at 13:33

1 Answers1

3

Once I came across a similar challenge, below is simplified solution:

import re

def re_compile(*args, flags: int =re.IGNORECASE, **kwargs):
    return re.compile(*args, flags=flags, *kwargs)

class Term(object):
    """"""
    def __init__(self, contain_patterns, *contain_args):
        self.matching_rules = []
        self.forbid_rules = []
        if isinstance(contain_patterns, str):
            self.may_contain(contain_patterns, *contain_args)
        else:
            for cp in contain_patterns:
                self.may_contain(cp, *contain_args)

    def __eq__(self, other):
        return isinstance(other, str) and self.is_alias(other)

    def is_alias(self, s: str):
        return (
            all(not f_rule(s) for f_rule in self.forbid_rules) and
            any(m_rule(s) for m_rule in self.matching_rules)
        )

    def matching_rule(self, f):
        self.matching_rules.append(f)
        return f

    def forbid_rule(self, f):
        self.forbid_rules.append(f)
        return f

    def must_rule(self, f):
        self.forbid_rules.append(lambda s: not f(s))
        return f

    def may_be(self, *re_fullmatch_args):
        self.matching_rules.append(re_compile(*re_fullmatch_args).fullmatch)

    def must_be(self, *re_fullmatch_args):
        fmatch = re_compile(*re_fullmatch_args).fullmatch
        self.forbid_rules.append(lambda s: not fmatch(s))

    def must_not_be(self, *re_fullmatch_args):
        self.forbid_rules.append(re_compile(*re_fullmatch_args).fullmatch)

    def may_contain(self, *re_search_args):
        self.matching_rules.append(re_compile(*re_search_args).search)

    def must_not_contain(self, *re_search_args):
        self.forbid_rules.append(re_compile(*re_search_args).search)

    def may_starts_with(self, *re_match_args):
        self.matching_rules.append(re_compile(*re_match_args).match)

    def must_not_starts_with(self, *re_match_args):
        self.forbid_rules.append(re_compile(*re_match_args).match)

In your case each car_model should be represented as Term instance with self regex rules (I do not know much about car brands, I invented some names):

if __name__ == '__main__':
    dmax = Term((r'd[ -._\'"]?max', r'Nissan DM'))
    dmax.may_contain(r'nissan\s+last\s+(year)?\s*model')
    dmax.must_not_contain(r'Skoda')
    dmax.must_not_contain(r'Volkswagen')

    @dmax.matching_rule
    def dmax_check(s):
        return re.search(r'double\s+max', s, re.IGNORECASE) and re.search(r'nissan', s, re.IGNORECASE)

    tg = Term(r'Tiguan')
    octav = Term(r'Octavia')

    titles = (
        'Dmax model',
        'd_Max nissan',
        'Nissan Double Max Pro',
        'nissan last model',
        'Skoda octavia',
        'skoda d-max',
        'Nissan Qashqai',
        'VW Polo double max'
    )

Your example:

for car_model in (dmax, tg, octav):
    print(car_model in titles)

Result:

True
False
True

Details:

print(' '*26, 'DMAX TIGUAN OCTAVIA')
for title in titles:
    print(title.ljust(26), (dmax == title), (tg == title), (octav == title))

Result:

                           DMAX TIGUAN OCTAVIA
Dmax model                 True False False
d_Max nissan               True False False
Nissan Double Max Pro      True False False
nissan last model          True False False
Skoda octavia              False False True
skoda d-max                False False False
Nissan Qashqai             False False False
VW Polo double max         False False False
facehugger
  • 388
  • 1
  • 2
  • 11