-1

I am having issues to get a string follow by period such as inc. ltd. corp.. AFAIK to match the . I should refer it as \. as in the following example:

\b(inc\.|ltd\.|corp\.|corp)\b(?=(?:.*\s+\w+$))

However, in words such as ABC LTD. BLOCK, SMALL LTD. ASSOCIATION, BASIC LTD. REGULAR NAME is not getting ltd., but if changed to \b(inc|ltd|corp)\b, I am finding ltd.

How can I include . when searching in a string?

rgx_list= 'inc\.|ltd\.|corp\.'
regex = r'\b({})\b(?=(?:.*\s+\w+$))'.format(rgx_list)
st='ABC LTD. BLOCK'

found = re.findall(regex, st.lower())

Thanks for your guindance

John Barton
  • 1,581
  • 4
  • 25
  • 51

3 Answers3

1

The problem isn't with escaping the .. The problem is with your use of \b around it.

\b matches a word boundary: a word character on the left and a non-word character on the right, or vice versa.

But you want to match between ltd. and the space after it. That's not a word boundary, because . and space are both non-word characters.

If you get rid of \b in regex it will work, although you might get other undesired matches. This is not easy to solve with regular expressions, since its concept of "word" is not as general as in natural language processing.

regex = r'\b({})(?=(?:.*\s+\w+$))'.format(rgx_list)
Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Which regular approach would you recommend for this scenarios? I already tried without word boundary, but as you said I am getting undesired matches. – John Barton Apr 06 '20 at 19:48
  • Regular expressions aren't the solution for everything. A natural language toolkit library might be more appropriate. – Barmar Apr 06 '20 at 20:08
0

\. inside a string is completely equivalent to ., so you're not escaping the characters properly inside the regex. You can either make rgx_list a raw string, or escape the backslashes: rgx_list= 'inc\\.|ltd\\.|corp\\.'

Robin Zigmond
  • 17,805
  • 2
  • 23
  • 34
0

Regular expression is awesome. However, every language employs it differently. And when the syntax is so strict the slightest difference can get you into trouble.

I highly suggest regex101 he has handled most of of these issues and it's my goto source.

Having said that in Python 3 the re library is already in "multiline" mode. Meaning you don't need to specify ^ start and $ as it is implicit. Given the context this could change.

import re 

word_list = "ABC LTD. BLOCK\nSMALL LTD. ASSOCIATION\nBASIC LTD. REGULAR NAME"

pattern = r".*[ltd|LTD]\.(?=\s+\w+)"

for found in re.findall(pattern, word_list):
    print(found)

Output

ABC LTD.
SMALL LTD.
BASIC LTD.

Note

The forward look ahead you are specifying .* which is any character along with \s+ which is a space and \w+ which is a word character.

Regex is an engine that does comparisons based on pattern. The simpler the pattern the better, faster searching and less cpu cycles.

Instead of .*\s+\w+ why not (?=\s+\w+) ?
for example:

    r".*[ltd|LTD]\.(?=\s+\w+)"

This will not match the word after ltd. which is what you intend yes?