14

I have the following string:

1 2 134 2009

And I'd like to capture the strings with between 1-3 digits, so the result should be:

['1', '2', '134']

What I have now captures those, but also captures the "first 3" digits in strings that contain more than 3 digits. This is the current regex I have:

>>> re.findall(r'\d{1,3}', '1 2 134 2009')
['1', '2', '134', '200', '9']

# or a bit closer --

>>> re.findall(r'\d{1,3}(?!\d)', '1 2 134 2009')
['1', '2', '134', '009']

What would be the correct way to make sure that another digit doesn't immediate proceed it?

jpp
  • 159,742
  • 34
  • 281
  • 339
  • 1
    What is the logic to match `123` in `['1', '2', '123']` – The fourth bird Nov 07 '18 at 19:37
  • @Thefourthbird I suppose that it would be a 'self-contained number', for example if someone looked the above string they could see that 4 numbers were contained in it. Not sure if I can give a more rigorous explanation. –  Nov 07 '18 at 19:39
  • 1
    @Thefourthbird oh I see. Sorry that was a typo -- fixed. –  Nov 07 '18 at 19:41
  • hmm... the dupe targets imply that this is a regex question. I still think it's not best solved with regex. – timgeb Nov 08 '18 at 12:17
  • 2
    This shouldn't have been closed, it's distinct. Voted to reopen. – smci Nov 19 '18 at 06:47
  • @timgeb: only if you assume all the characters are either digits or spaces. If you actually have to check/match that, then regex is the simplest solution. – smci Nov 19 '18 at 06:47
  • @smci should I assume otherwise given the OP's input? – timgeb Nov 19 '18 at 06:57
  • @timgeb: yes you should because the question doesn't specify *"Capture all numbers up to three digits on input which is only whitespace or digits"*. So yes, answer the question as broadly as asked. Your answer will break if there's even one punctuation character. Too brittle. – smci Nov 19 '18 at 07:03
  • @smci I disagree. The post tightly and unambiguously constrains the format of the string. In any case, we have a regex answer and a non-regex answer, so everything is fine. – timgeb Nov 19 '18 at 07:06
  • @timgeb: No, the question body doesn't constrain anything. It merely happens to show one example where the existing regex captured digit strings (but they could contain other things, like leading or trailing currency symbols, '$' or punctuation). Then you suggest post-processing those by assuming they must be digits. In any case we would not run two regexes: first a `\d{1,3}(?!\d)` then (the proper one) `\b\d{1,3}\b`. We would simply only use the second regex. Secondly it's inefficient to use both a regex and string methods. Third you're claiming the question title disagrees with the body IYO. – smci Nov 19 '18 at 07:29

2 Answers2

16

Add word boundaries:

import re

result = re.findall(r'\b\d{1,3}\b', '1 2 134 2009')

print(result)

Output

['1', '2', '134']

From the documentation \b:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

By default Unicode alphanumerics are the ones used in Unicode patterns, but this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
13

If there are only digits separated by whitespace in your string, using re is overkill. You can simply split the string and check the length of the substrings.

>>> numbers = '1 2 134 2009'
>>> [n for n in numbers.split() if len(n) <= 3]
>>> ['1', '2', '134']
timgeb
  • 76,762
  • 20
  • 123
  • 145
  • 1
    *In general*, Python `str` operations outperform regex, so I think this solution should preferred if the formatting constraints are as per the example data. – jpp Nov 19 '18 at 09:22