1

I have a regex: r'((\+91|0)?\s?\d{10})'

I'm trying to match numbers like +91 1234567890, 1234567790, 01234567890.

These numbers shouldn't be matched: 1234568901112 because it doesn't start with +91 or 0 or doesn't have just 10 numbers:

When I try to use re.findall():

re.findall(r'((\+91|0)?\s?\d{10})', '+91 1234567890, 1234567790, 01234567890, 1234568901112')
[('+91 1234567890', '+91'),
 (' 1234567790', ''),
 (' 0123456789', ''),
 (' 1234568901', '')]

You can notice that in the third and fourth index the output is not what I want. My expected output at third index is 01234568890 and because it starts with 0 and followed by 10 characters. But it's only showing the first 10 characters. Also I don't want the output in the 4th index because it the number doesn't completely match. So either it matched the complete word/string else it is invalid.

Is there any other regex that I can use? Or a function? What am I doing wrong here?

The expected output is:

[('+91 1234567890','1234567790', '01234567890']

Please let me know if any more clarifications are needed.

Mohit Motwani
  • 4,662
  • 3
  • 17
  • 45

1 Answers1

2

You may use

r'(?<!\w)(?:(?:\+91|0)\s?)?\d{10}\b'

See the regex demo.

The point is to match these patterns as whole words, the problem is that the first part is optional and one of the optional alteratives starts with a non-word char, so a single \b word boundary won't work here.

Details

  • (?<!\w) - there should be no word char immediately to the left of the current location
  • (?:(?:\+91|0)\s?)? - an optional occurrence of
    • (?:\+91|0) - +91 or 0
    • \s? - an optional whitespace
  • \d{10}\b - ten digits matches as a whole word, no word chars allowed on both sides

Python demo:

import re
s = '+91 1234567890, 1234567790, 012345678900, 1234568901112, 01234567890'
print(re.findall(r'(?<!\w)(?:(?:\+91|0)\s?)?\d{10}\b', s))
# => ['+91 1234567890', '1234567790', '01234567890']
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Thank you for your response. I've made one small change. There was a typo. I've changed 012345678900 to 01234567890 in the input. This should match too. – Mohit Motwani Mar 19 '19 at 10:23
  • @MohitMotwani I updated the post. I thought about the unambiguous word boundaries (like `(?<!\w)` / `(?!\w)`) from the very start, but your examples put me off-track. – Wiktor Stribiżew Mar 19 '19 at 10:28
  • This works! Thank you so much. `(?<!\w)` This is exactly what I was looking for. Although can you please explain me `?:` more clearly. I'm a bit confused. – Mohit Motwani Mar 19 '19 at 10:29
  • 1
    @MohitMotwani `(?:...)` is a [non-capturing group](https://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-do). It is used here to make sure `re.findall` [does not behave weird](https://stackoverflow.com/a/31915134/3832970) – Wiktor Stribiżew Mar 19 '19 at 10:32