articles = ['a','an','the']
regex = r"\b(?:{})\b".format("|".join(word))
sent = 'Davis is theta'
re.split(regex,sent)
>> ['Davis ', ' theta']
This snippet works with the English language but used with Devnagari scripts, it matches the partial word as well.
stopwords = ['कम','र','छ']
regex = r"\b(?:{})\b".format("|".join(stopwords))
sent = "रामको कम्पनी छ"
re.split(regex,sent)
>> ['', 'ामको ', '्पनी छ']
Expected output
['रामको' 'कम्पनी']
I am using python3. Is it a bug or am I missing something ?
I suspect /b matches [a-zA-Z0-9] and I am using unicode. Is there an alternative to this task?