Word Boundary regex does not match the whole word for Devnagari Script

Question

articles = ['a','an','the']
regex = r"\b(?:{})\b".format("|".join(word))
sent = 'Davis is theta'
re.split(regex,sent)
>> ['Davis ', ' theta']

This snippet works with the English language but used with Devnagari scripts, it matches the partial word as well.

stopwords = ['कम','र','छ']
regex = r"\b(?:{})\b".format("|".join(stopwords))
sent = "रामको कम्पनी छ"
re.split(regex,sent)
>> ['', 'ामको ', '्पनी छ']

Expected output

['रामको' 'कम्पनी']

I am using python3. Is it a bug or am I missing something ?

I suspect /b matches [a-zA-Z0-9] and I am using unicode. Is there an alternative to this task?

@anubhava didn't work. ['रामको', '्पनी छ'] is not the desired output — Ashutosh Chapagain, Jun 10 '19 at 06:40

anubhava · Accepted Answer · 2019-06-10T07:02:04.743

You may want to use this code using findall instead of split:

import re

stopwords = ['कम','र','छ']
reg = re.compile(r'(?!(?:{})(?!\S))\S+'.format("|".join(stopwords)))

sent = 'रामको कम्पनी छ'
print (reg.findall(sent))

This regex avoids use of word boundary which doesn't work well with Unicode text such as Devanagri.

RegEx Code Demo

Check: Python unicode regular expression matching failing with some unicode characters -bug or mistake?

Word Boundary regex does not match the whole word for Devnagari Script

1 Answers1