Capture all numbers up to three digits

Question

I have the following string:

1 2 134 2009

And I'd like to capture the strings with between 1-3 digits, so the result should be:

['1', '2', '134']

What I have now captures those, but also captures the "first 3" digits in strings that contain more than 3 digits. This is the current regex I have:

>>> re.findall(r'\d{1,3}', '1 2 134 2009')
['1', '2', '134', '200', '9']

# or a bit closer --

>>> re.findall(r'\d{1,3}(?!\d)', '1 2 134 2009')
['1', '2', '134', '009']

What would be the correct way to make sure that another digit doesn't immediate proceed it?

@Thefourthbird I suppose that it would be a 'self-contained number', for example if someone looked the above string they could see that 4 numbers were contained in it. Not sure if I can give a more rigorous explanation. — , Nov 07 '18 at 19:39
hmm... the dupe targets imply that this is a regex question. I still think it's not best solved with regex. — timgeb, Nov 08 '18 at 12:17
This shouldn't have been closed, it's distinct. Voted to reopen. — smci, Nov 19 '18 at 06:47
@timgeb: only if you assume all the characters are either digits or spaces. If you actually have to check/match that, then regex is the simplest solution. — smci, Nov 19 '18 at 06:47
@timgeb: yes you should because the question doesn't specify *"Capture all numbers up to three digits on input which is only whitespace or digits"*. So yes, answer the question as broadly as asked. Your answer will break if there's even one punctuation character. Too brittle. — smci, Nov 19 '18 at 07:03
@smci I disagree. The post tightly and unambiguously constrains the format of the string. In any case, we have a regex answer and a non-regex answer, so everything is fine. — timgeb, Nov 19 '18 at 07:06
@timgeb: No, the question body doesn't constrain anything. It merely happens to show one example where the existing regex captured digit strings (but they could contain other things, like leading or trailing currency symbols, '$' or punctuation). Then you suggest post-processing those by assuming they must be digits. In any case we would not run two regexes: first a `\d{1,3}(?!\d)` then (the proper one) `\b\d{1,3}\b`. We would simply only use the second regex. Secondly it's inefficient to use both a regex and string methods. Third you're claiming the question title disagrees with the body IYO. — smci, Nov 19 '18 at 07:29

Dani Mesejo · Accepted Answer · 2018-11-07T19:45:14.053

Add word boundaries:

import re

result = re.findall(r'\b\d{1,3}\b', '1 2 134 2009')

print(result)

Output

['1', '2', '134']

From the documentation \b:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

By default Unicode alphanumerics are the ones used in Unicode patterns, but this can be changed by using the ASCII flag. Word boundaries are determined by the current locale if the LOCALE flag is used. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

score 13 · Answer 2 · answered Nov 07 '18 at 19:40

13

If there are only digits separated by whitespace in your string, using re is overkill. You can simply split the string and check the length of the substrings.

>>> numbers = '1 2 134 2009'
>>> [n for n in numbers.split() if len(n) <= 3]
>>> ['1', '2', '134']

answered Nov 07 '18 at 19:40

timgeb

76,762
20
123
145

1

*In general*, Python `str` operations outperform regex, so I think this solution should preferred if the formatting constraints are as per the example data. – jpp Nov 19 '18 at 09:22

Capture all numbers up to three digits

2 Answers2

Linked