Regex to match all types of percentage

Question

I have some % cases as the follow -

I want to match all the percentages type except anything larger than 100. Expected Output:

I have tried (Regular Expression for Percentage of marks). But this one fails to get all the cases that I want. Also, I am replacing the non-match with empty string. So my code in python looks like like this -

pattern=r'(\b(?<!\.)(?!0+(?:\.0+)?%)(?:\d|[1-9]\d|100)(?:(?<!100)\.\d+)?$)'
df['Percent']=df['Percent'].astype(str).str.extract(pattern)[0]

Many thanks.

Edit: The solution (by @rv.kvetch) matches most of the edge cases except the 0 ones but I can work with that limitation. The original post had requirement of not matching 0 case or 0%.

the majority of the inputs don't have the % suffix, is that intentional? — rv.kvetch, Sep 25 '21 at 00:49
no. That's how they are in the dataset. I only copied a few. — non_linear, Sep 25 '21 at 00:50
btw you can add a `%?` in regex to match an optional % symbol at the end — rv.kvetch, Sep 25 '21 at 00:51

score 1 · Answer 1 · answered Sep 25 '21 at 06:41

1

If you want, you can do it without using regex.

nums = ['12.02'
'16.59',
'81.61%',
'45',
'24.812',
'51.35',
'19348952',
'88.22',
'0',
'000',
'021',
'.85%',
'100']

for n in nums:
  x = n.sptrip('%')
  x = int(x)
  if x <= 100:
    print(n)

answered Sep 25 '21 at 06:41

Hirusha Fernando

1,156
10
29

rv.kvetch · Accepted Answer · 2021-09-27T14:51:31.677

0

I'm probably very close but looks like this is working for me so far:

^(?:0{0,})((?:[1-9]{1,2}|100)?(?:\.\d+)?)%?$

Regex demo

Description

First non-capturing group

(?:0{0,}) - non-capturing group which matches a leading 0, that appears zero or more times.

First capture group

(?:[1-9]{1,2}|100)? - Optional, non-capturing group which matches the digits 1-9 one to two times, to essentially cover the range 1-99. Then an or condition so we also cover 100. This group is made optional by ? to cover cases like .24, which is still a valid percentage.
(?:\.\d+)? - Optional, non-capturing group which matches the fractional part, e.g. .123. This is optional because numbers like 20 are valid percentage values by themselves.

Last non-capturing group

%? - finally, here we match the optional trailing percent (%) symbol that can come at the end.

Update

Here is a non-regex approach that should be more efficient than a regex approach. This also covers edge cases like .0 that the regex currently hasn't been updated to handle:

string = """
12.02
16.59
81.61%
45
24.812
51.35
19348952
88.22
0
000
.0%
021
.85%
100
150
1.2.3
hello world
"""

for n in string.split('\n'):
    try:
        x = float(n.rstrip('%'))
    except ValueError: # invalid numeric value
        continue
    # Check if number is in the range 0..100 (non-inclusive of 0)
    if 0 < x <= 100:
        print(x)

edited Sep 27 '21 at 14:51

answered Sep 25 '21 at 01:01

rv.kvetch

9,940
3
24
53

It matches the 0 or 0 cases as well..anyway you could exclude those as well..? – non_linear Sep 25 '21 at 01:04
yep, just updated. looks like it does exclude them now. at least, it's not included in the first group. – rv.kvetch Sep 25 '21 at 01:08
But it does still capture ones like `.0%` - so I might need to fix that – rv.kvetch Sep 25 '21 at 01:10
Updated to add a description with the breakdown of the regex – rv.kvetch Sep 25 '21 at 01:16
1

I think I could work with this 0 case limitation as this solution still matches most of the edge cases. I will mark this as answer after editing the question. Thanks – non_linear Sep 25 '21 at 01:44
why can't r'\d+\.\d+%' be used to find the percentage data? – Golden Lion Sep 27 '21 at 14:40
that could, very easily. I think a no-regex approach as mentioned by @Hirusha Fernando is likely more efficient also. But I think OP had specific requirements for this - for example, use regex to match only percentages in the range 1-100 (but `\d+\.\d+%` would of course would get you values outside this range. It depends I guess if you wanted to implement the validation outside of regex. – rv.kvetch Sep 27 '21 at 14:46
@GoldenLion I already tried that one before asking here but as rv.kvetch mentioned it fails to catch a lot of edge cases. I actually tried to post most of the unique ones but there are others cases as well that would not be matched. – non_linear Oct 02 '21 at 13:04