0

So I am working on a text analytics problem and I am trying to remove all the numbers between 0 and 999 with regular expression in Python. I have tried Regex Numeric Range Generator to get the regular expression but I didn't succed. I can only remove all the numbers.

I have tried several Regex but it didn't work. here's what I tried

# Remove numbers starting from 0 ==> 999
data_to_clean = re.sub('[^[0-9]{1,3}$]', ' ', data_to_clean)

I have tried this also:

# Remove numbers starting from 0 ==> 999
data_to_clean = re.sub('\b([0-9]|[1-8][0-9]|9[0-9]|[1-8][0-9]{2}|9[0-8][0-9]|99[0-9])\b', ' ', data_to_clean)  

this one:

^([0-9]|[1-8][0-9]|9[0-9]|[1-8][0-9]{2}|9[0-8][0-9]|99[0-9])$

and this:

def clean_data(data_to_clean):
    # Remove numbers starting from 0 ==> 999
    data_to_clean = re.sub('[^[0-9]{1,3}$]', ' ', data_to_clean)  
    return data_to_clean

I have a lot of numbers but I need to delete just the ones under 3 decimals and keep the other.

Thank You for your help

Yam Mesicka
  • 6,243
  • 7
  • 45
  • 64
kawai
  • 7
  • 5
  • 1
    shouldn't this combination from your tries work: `\b[0-9]{1,3}\b`? If you check: https://regex101.com/r/qDrobh/6 it should work – gaw Feb 12 '19 at 14:13
  • could you post an example text, where the numbers should be replaced? – gaw Feb 12 '19 at 14:19
  • **bonjour la commande 2000501784 est validée et pour autant je ne peux la réceptionner poste 30 merci d avance ** I am getting the same result – kawai Feb 12 '19 at 14:19
  • I should delete 30 – kawai Feb 12 '19 at 14:19

3 Answers3

1

You need precede the pattern string with an r to prevent escaping so the interpeter won't swap \b with a backspace. Plus you can simplify the pattern like this:

data_to_clean = re.sub(r'\b([0-9]|[1-9][0-9]{1,2})\b', ' ', data_to_clean)
krisz
  • 2,686
  • 2
  • 11
  • 18
  • Not sure if it is required but numbers with leading zero are not included (e.g. 000, 001, ...) I think it was intentional but it's worth to mention – gaw Feb 12 '19 at 14:55
  • I assumed leading zeros should not be included since op tried to use a Regex Numeric Range Generator – krisz Feb 12 '19 at 14:59
  • I like your answer way more, then JGNI's since a lookahead is really expensive and not necessary here. It also shows his answer needs >400 steps compared to ~200 steps from your answer. – gaw Feb 12 '19 at 15:24
0

I think you can use a combination of your try with word boundaries (\b) and your last try ([0-9]{1,3}).

So the resulting regex should look like: \b[0-9]{1,3}\b

If you check the demo: regex101.com/r/qDrobh/6 It should replace all 1-digit, 2-digit and 3-digit numbers and ignore higher numbers and other words.

gaw
  • 1,960
  • 2
  • 14
  • 18
  • it doesn't remove the numbers even with those regular expressions. – kawai Feb 12 '19 at 14:29
  • You are welcome, but dont forget to mark the correct answer and/or vote for helpful comments. The answer from @krisz looks very good to me, and even considers whether leading zeros are there. But it will not remove 000 for example – gaw Feb 12 '19 at 14:54
  • Ok @gaw. Thank u for ur help – kawai Feb 12 '19 at 15:12
0

Numbers from 0 to 999 are

  1. A single character [0-9]
  2. Two characters [1-9][0-9]
  3. Three characters [1-9][0-9][0-9]

This gives a naive regex of /\b(?:[0-9]|[1-9][0-9]|[1-9][0-9][0-9])\b/ However we have duplicated characters classes in the options so we can factor them out

/(?!\b0[0-9])\b[0-9]{1,3}\b/

This works by using a negative lookahead (?!\b0[0-9]) to check for the start of a word followed by a 0 followed by a digit to disregard 01 etc. and then looks for 1 to three 0 - 9 characters. Because the negative lookahead needs at least 2 characters a single 0 still passes as valid.

JGNI
  • 3,933
  • 11
  • 21