I would like to know how to build a very simple tokenizer. Given a dictionary d (in this case a list) and a sentence s I would like to return all possible tokens (=words) of the sentence. Here is what I tried:
l = ["the","snow","ball","snowball","is","cold"]
sentence = "thesnowballisverycold"
def subs(string, ret=['']):
if len(string) == 0:
return ret
head, tail = string[0], string[1:]
ret = ret + list(map(lambda x: x+head, ret))
return subs(tail, ret)
print((list(set(subs(sentence))&set(l))))
But this returns:
["snow","ball","cold","is","snowball","the"]
I could compare substrings but there must be a better way to do this, right? What I want:
["the","snowball","is","cold"]