0

Consider this Python regex for finding phone numbers:

reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)

The problem is that this will match any string of digits at least 10 characters in length, so I need to ensure that if there is a character preceding the regex, then it cannot be a digit.

This won't work because it breaks if the phone number is the beginning of the string:

reg = re.compile(".*?\D(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)

This won't work because the prior .*? might end in a digit:

reg = re.compile(".*?[\D]?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)

What does work?

EDIT:

Martijn's regex breaks on match even though it works for search:

>>> text = 'The Black Cat Cafe is located at 45 Main Street, Irvington NY 10533, in one of the \nRiver Towns of Westchester. ..... Our unique menu includes baked ziti pizza, \nchicken marsala pizza, margherita pizza and many more choices. ..... 914-232-2800 ...... cuisine, is located at 36 Main Street, New Paltz, NY 12561 in Ulster \nCounty.'
>>> reg = re.compile(r"(?<!\d)(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4})(?!\d)", re.S)
>>> reg.search(text).groups()[0]
'914-232-2800'
>>> reg.match(text) is None
True
>>> reg_dotan = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
>>> reg_dotan.search(text).groups()[0]
'914-232-2800'
>>> reg_dotan.match(text) is None
False

In the application, I'm running the regex in a list comprehension:

have_phones = [d for d in descriptions if reg.match(d)]
dotancohen
  • 30,064
  • 36
  • 138
  • 197

1 Answers1

1

Use a negative lookbehind assertion:

reg = re.compile(r"(?<!\d)(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4})(?!\d)", re.S)

I've included a negative lookahead as well at the end. Negative lookbehind and lookahead assertions only match a position in text where the text preceding or following such a position does not match a pattern.

This is like the ^ and $ anchors, in that they too match specific positions, not characters themselves. In the text 'a1b2c' the start of the string as well as positions after a, b and c all match the (?<!\d) negative lookbehind, because at those positions the preceding character is not a digit (where there is no character at all at the start).

Using these makes your pattern match only if there is no digit right before it, and no digit right after the pattern; the start and end of a string qualify here as well.

Quick demo:

>>> reg.search('0123456789')          # 10 digits
<_sre.SRE_Match object at 0x1026ea468>
>>> reg.search('10123456789') is None # 11 digits
True
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    You should really add some sample input and expected output to your question. – Martijn Pieters Dec 10 '13 at 11:16
  • @dotancohen: That is entirely logical; `reg.match` limits matches to the start of the string. That is the defining difference between the functions. The pattern I describe should be used with `re.search`, unless you can only match at the start of strings. – Martijn Pieters Dec 10 '13 at 11:34
  • @dotancohen: see [What is the difference between Python's re.search and re.match?](http://stackoverflow.com/q/180986) – Martijn Pieters Dec 10 '13 at 11:35
  • Thanks, I'm going over the fine docs again now. However, see that the regex that I posted in the OP _does_ work, I've added it to the output in the question. – dotancohen Dec 10 '13 at 11:39
  • Thank you Martijn. I was able to massage lookbehind / lookahead into the regex. However, I'm still unsure why the `match()` function does not return `None` using my original regex. See edited OP. Thanks. – dotancohen Dec 11 '13 at 09:00
  • Because you start your pattern with `.*?`, matching everything up to the first digit. – Martijn Pieters Dec 11 '13 at 09:32
  • Wow, I cannot believe that was staring me in the face! Thank you so much Martijn. You are patient and an excellent teacher! – dotancohen Dec 11 '13 at 09:57