How to build a simple tokenizer

Question

I would like to know how to build a very simple tokenizer. Given a dictionary d (in this case a list) and a sentence s I would like to return all possible tokens (=words) of the sentence. Here is what I tried:

l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"

def subs(string, ret=['']):
    if len(string) == 0:
        return ret
    head, tail = string[0], string[1:]
    ret = ret + list(map(lambda x: x+head, ret))
    return subs(tail, ret)
    
print((list(set(subs(sentence))&set(l))))

But this returns:

["snow","ball","cold","is","snowball","the"]

I could compare substrings but there must be a better way to do this, right? What I want:

["the","snowball","is","cold"]

Wiktor Stribiżew · Accepted Answer · 2021-01-26T17:45:27.733

6

You can utilize a regular expression here:

import re
l = ["the","snow","ball","snowball","is","cold"]
pattern = "|".join(sorted(l, key=len, reverse=True))
sentence = "thesnowballisverycold"
print( re.findall(pattern, sentence) )
# => ['the', 'snowball', 'is', 'cold']

See the Python demo.

The pattern will look like snowball|snow|ball|cold|the|is, see the regex demo online. The trick is to make sure all alternatives are listed from the longest to shortest. See Order of regular expression operator (..|.. ... ..|..). The sorted(l, key=len, reverse=True) part sorts the items in l by length in the descending order, and "|".join(...) creates the alternation pattern.

edited Jan 26 '21 at 17:45

answered Jan 26 '21 at 17:42

Wiktor Stribiżew

607,720
39
448
563

Why do you need to sort? – Sayandip Dutta Jan 26 '21 at 17:43
@SayandipDutta Otherwise,shorter tokens with the same prefix will get matched, like `snow` instead of `snowball`. – Wiktor Stribiżew Jan 26 '21 at 17:44
Oh, yeah. Right. +1 – Sayandip Dutta Jan 26 '21 at 17:45

How to build a simple tokenizer

1 Answers1