Regex: Matching a word with period using \. not working

Question

I am having issues to get a string follow by period such as inc. ltd. corp.. AFAIK to match the . I should refer it as \. as in the following example:

\b(inc\.|ltd\.|corp\.|corp)\b(?=(?:.*\s+\w+$))

However, in words such as ABC LTD. BLOCK, SMALL LTD. ASSOCIATION, BASIC LTD. REGULAR NAME is not getting ltd., but if changed to \b(inc|ltd|corp)\b, I am finding ltd.

How can I include . when searching in a string?

rgx_list= 'inc\.|ltd\.|corp\.'
regex = r'\b({})\b(?=(?:.*\s+\w+$))'.format(rgx_list)
st='ABC LTD. BLOCK'

found = re.findall(regex, st.lower())

Thanks for your guindance

The problem isn't in `rgx_list`. The problem is with the lookahead in `regex`. — Barmar, Apr 06 '20 at 19:35
You do not need to refuse from checking if your text is enclosed with word chars. You just need `(?<!\w)(inc\.|ltd\.|corp\.|corp)(?!\w)(?=.*\s\w+$)` - and you won't match `corporation` any longer. — Wiktor Stribiżew, Apr 06 '20 at 22:03

score 1 · Accepted Answer · answered Apr 06 '20 at 19:41

1

The problem isn't with escaping the .. The problem is with your use of \b around it.

\b matches a word boundary: a word character on the left and a non-word character on the right, or vice versa.

But you want to match between ltd. and the space after it. That's not a word boundary, because . and space are both non-word characters.

If you get rid of \b in regex it will work, although you might get other undesired matches. This is not easy to solve with regular expressions, since its concept of "word" is not as general as in natural language processing.

regex = r'\b({})(?=(?:.*\s+\w+$))'.format(rgx_list)

answered Apr 06 '20 at 19:41

Barmar

741,623
53
500
612

Which regular approach would you recommend for this scenarios? I already tried without word boundary, but as you said I am getting undesired matches. – John Barton Apr 06 '20 at 19:48
Regular expressions aren't the solution for everything. A natural language toolkit library might be more appropriate. – Barmar Apr 06 '20 at 20:08

score 0 · Answer 2 · answered Apr 06 '20 at 19:37

0

\. inside a string is completely equivalent to ., so you're not escaping the characters properly inside the regex. You can either make rgx_list a raw string, or escape the backslashes: rgx_list= 'inc\\.|ltd\\.|corp\\.'

answered Apr 06 '20 at 19:37

Robin Zigmond

17,805
2
23
34

marketzero · Answer 3 · 2020-04-06T20:20:53.170

Regular expression is awesome. However, every language employs it differently. And when the syntax is so strict the slightest difference can get you into trouble.

I highly suggest regex101 he has handled most of of these issues and it's my goto source.

Having said that in Python 3 the re library is already in "multiline" mode. Meaning you don't need to specify ^ start and $ as it is implicit. Given the context this could change.

import re 

word_list = "ABC LTD. BLOCK\nSMALL LTD. ASSOCIATION\nBASIC LTD. REGULAR NAME"

pattern = r".*[ltd|LTD]\.(?=\s+\w+)"

for found in re.findall(pattern, word_list):
    print(found)

Output

ABC LTD.
SMALL LTD.
BASIC LTD.

Note

The forward look ahead you are specifying .* which is any character along with \s+ which is a space and \w+ which is a word character.

Regex is an engine that does comparisons based on pattern. The simpler the pattern the better, faster searching and less cpu cycles.

Instead of .*\s+\w+ why not (?=\s+\w+) ?
for example:

    r".*[ltd|LTD]\.(?=\s+\w+)"

This will not match the word after ltd. which is what you intend yes?

Regex: Matching a word with period using \. not working

3 Answers3

Output

Note