0

I have the following case:

Test (2.00001) Test (2.000) Test 2.1 Test (2,0001) Test 2,000 Test 2,1000 test 2

I try to use regex to find only the integers:

  1. 2.000
  2. 2,000
  3. 2

but not the other float numbers.
I tried different things:

re.search('(?<![0-9.])2(?![.,]?[1-9])(?=[.,]*[0]*)(?![1-9]),...)

but this returns true for:

  1. 2.00001
  2. 2.000
  3. 2,000
  4. 2,0001
  5. 2

What have I to do?

UPDATE
I have updated the question and it should also find an integer without any comma and point, too (2).

Code Pope
  • 5,075
  • 8
  • 26
  • 68
  • Try `(?<!\d)(?<!\d[.,])\d{1,3}(?:[.,]\d{3})*(?![,.]?\d)`, see [demo](https://regex101.com/r/qrG8hg/1). – Wiktor Stribiżew Nov 10 '22 at 14:25
  • @WiktorStribiżew it does not match all integers in `test 2 2.00` – Code Pope Nov 10 '22 at 14:28
  • If you do NOT want to match `2.00001`, why do you want to match `2.00`? How can you formulate the pattern requirements regarding differentiation between valid and non-valid floats? – Wiktor Stribiżew Nov 10 '22 at 14:30
  • Mathematically the value of `2.00` is an integer, the value of `2.0001` is not. I am looking if the integer `2` is existing in the string. – Code Pope Nov 10 '22 at 14:36
  • 1
    What about `(?<!\d)(?<!\d[.,])(?:\d{1,3}(?:([.,])\d{3})*|\d{4,})(?:(?!\1)[.,]0+)?(?![,.]?\d)`? See https://regex101.com/r/qrG8hg/2 – Wiktor Stribiżew Nov 10 '22 at 14:40
  • 1
    If you do not need to support thousand separators: `(?<!\d)(?<!\d[.,])\d+(?:[.,]0+)?(?![,.]?\d)` - see [this demo](https://regex101.com/r/qrG8hg/3). – Wiktor Stribiżew Nov 10 '22 at 14:45

4 Answers4

1

I would use:

import re

text = 'Test (2.00001) Test (2.000) Test 2.1 Test (2,0001) Test 2,000 Test 2,1000'

re.findall(r'(\d+[.,]0+)(?!\d)', text)

Output:

['2.000', '2,000']

Regex:

(        # start capturing
\d+      # match digit(s)
[.,]     # match . or ,
0+       # match one or more zeros
)        # stop capturing
(?!\d)   # ensure the last zero is not followed by a digit

regex demo

If you also want to match "intergers" alone, surrounded by spaces or parentheses/brackets:

import re

text = 'Test (2.00001) Test (2.000) Test 2.1 Test (2,0001) Test 2,000 Test 2,1000 2'

re.findall(r'(?:^|[(\s[])(\d+(?:[.,]0+(?!\d))?)(?=[]\s)]|$)', text)

Regex:

(?:^|[(\s[])      # match the start of string or [ or ( or space
(                 # start capturing
\d+               # match digit(s)
(?:[.,]0+(?!\d))? # optionally match . or , with only zeros
)                 # stop capturing
(?=[]\s)]|$)      # match the end of string or ] or ) or space

regex demo

mozway
  • 194,879
  • 13
  • 39
  • 75
  • Ok, I have to update my question. It should of course also match `2` alone without `,` and `.`. – Code Pope Nov 10 '22 at 13:15
  • @CodePope OK, but how do you define a number? can you have cases like `abc123` or `127.0.0.0`? – mozway Nov 10 '22 at 13:38
  • No, such cases are not possible. Numbers appear either in brackets or have at least whitespace prior to them. – Code Pope Nov 10 '22 at 13:44
  • But it does not match all integers when the string is `2 2.00 `. – Code Pope Nov 10 '22 at 14:06
  • You said there is "*at least whitespace prior to them*", I didn't include a start of string boundary, let me update ;) – mozway Nov 10 '22 at 14:07
  • Thanks. Could you add some descriptions like your first regular expression. It would be extremely helpful. – Code Pope Nov 10 '22 at 14:26
  • @CodePope have you seen the [regex demo](https://regex101.com/r/1hNnHj/1) link? This provides a detailed explanation of all the steps. But I'll try to add something – mozway Nov 10 '22 at 14:30
  • It is not good to use a negative lookahead after a quantified pattern like `0+(?!\d)`. If there are `0001`, it will just match the first two zeros. It won't fail the match. – Wiktor Stribiżew Nov 10 '22 at 14:43
  • @Wiktor thanks for the feedback, how would you suggest to improve it? using the negative lookahead **before** the quantifier? `(?:0(?!\d))+`? – mozway Nov 10 '22 at 14:45
  • @Wiktor I certainly trust your regex expertise, but I don't understand how my approach would not work, on the example you provided as comment, my regex works and also uses less steps. I'd appreciate a lot a counter-example if you have one. – mozway Nov 10 '22 at 14:49
  • See your `re.findall(r'(\d+[.,]0+)(?!\d)', text)` solution and then see [*Negative lookahead not working after character range with plus quantifier*](https://stackoverflow.com/a/65760343/3832970). The second solution will need boundary adaptation each time a new boundary char appears. – Wiktor Stribiżew Nov 10 '22 at 15:03
1

You can use

re.findall(r'\b(?<!\d[.,])\d+(?:[.,]0+)?\b(?![,.]\d)', text)

See the regex demo. Details:

  • \b - a word boundary
  • (?<!\d[.,]) - no digit followed with . or , immediately on the left
  • \d+ - one or more digits
  • (?:[.,]0+)? - an optional sequence of . or , and then one or more zeros
  • \b - a word boundary
  • (?![,.]\d) - no , or . and a digit allowed immediately to the right.

If you need to support thousand separators:

pattern = r'\b(?<!\d[.,])(?:\d{1,3}(?:(?=([.,]))(?:\1\d{3})+)?|\d{4,})(?:(?!\1)[.,]0+)?\b(?![,.]\d)'
matches = [x.group() for x in re.finditer(pattern, text)]

See this regex demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
0

Without the need for regex, you can also consider using is_integer() after trying to conver the values into their respective numeric formats. While a little bit harder to read, it removes the need for regex and should be robust for further use cases given the string structure you provide:

[x for x in string.split() if float((pd.to_numeric(x.replace(r'(','').replace(r')','').replace(r',','.'),errors='coerce'))).is_integer()]

Returning the former values in the list:

['(2.000)', '2,000', '2']

Or if you'd like them cleaned:

[x for x in string.replace(r'(','').replace(r')','').replace(r',','.').split() if float((pd.to_numeric(x,errors='coerce'))).is_integer()]

Returning:

['2.000', '2.000', '2']
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
0

This should be easy - just get a number and check "is this an int value?". Meaby something like this...

import re

text = 'Test (2.00001) Test (2.000) Test 2.1 Test (2,0001) Test 2,000 Test 2,1000 test 2'
out_ints = []
for x in  re.findall(r'([0-9.,]+)', text):
    possible_int = x.replace(',', '.')
    is_int = int(float(possible_int)) == float(possible_int)
    if is_int:
        out_ints.append(int(float(possible_int)))

print(out_ints)

Output:

[2, 2, 2]

Or am i missing something?

RobertG
  • 416
  • 1
  • 8