String not matching correct string in alternators using findall

Question

I used re.findall to tokenize strings that not always have to be splitting after a word (a token can have compound words). I got the tokens in the described way. However, it is not keeping dots included in the regex pattern.

For instance, consider the following code:

import re
all_domain=['com edu','.com edu','inc.', '.com', 'inc', 'com', '.edu', 'edu']
all_domain.sort(key=len, reverse=True)
domain_alternators = '|'.join(all_domain)

print(domain_alternators)
regex = re.compile(r'\b({}|[a-z-A-Z]+)\b'.format(domain_alternators))
print(regex)
#re.compile('\\b(.com edu|com edu|inc.|.com|.edu|inc|com|edu|[a-z-A-Z]+)\\b')

name= 'BASIC SCHOOL DISTRICT .COM'
result=regex.findall(name.lower())

it should return as a result ['basic', 'school', 'district', '.com'] because .com has higher priority in the alternators (.com comes before com in the alternator lists):

.com edu|com edu|inc.|.com|.edu|inc|com|edu

How can I get ['basic', 'school', 'district', '.com'] instead of getting ['basic', 'school', 'district', 'com']

Thanks

When you have a string like `.com` there is no `\b` before the `.`. From the docs: `\b is defined as the boundary between a \w and a \W character` — Mark, Mar 11 '20 at 20:16

score 1 · Accepted Answer · answered Mar 11 '20 at 20:16

You should:

Escape the alternatives so that . could match a dot (that is, use '|'.join(map(re.escape,all_domain)))
Use unambiguous word boundaries, left-hand (?<!\w) and right-hand (?!\w), because \b meaning is context-dependent, see Regular Expression Word Boundary and Special Characters and regex to match word boundary beginning with special characters and there are a lot more such questions.

Use

import re
all_domain=['com edu','.com edu','inc.', '.com', 'inc', 'com', '.edu', 'edu']
all_domain.sort(key=len, reverse=True)
domain_alternators = '|'.join(map(re.escape,all_domain)) # <-- HERE
regex = re.compile(r'(?<!\w)({}|[a-z-A-Z]+)(?!\w)'.format(domain_alternators))  # <-- HERE

name= 'BASIC SCHOOL DISTRICT .COM'
result=regex.findall(name.lower())
print(result) # => ['basic', 'school', 'district', '.com']

See the Python demo

String not matching correct string in alternators using findall

1 Answers1