2

What is the best way to remove words in a string that start with numbers and contain periods in Python?

this_string = 'lorum3 ipsum 15.2.3.9.7 bar foo 1. v more text 46 2. here and even more text here v7.8.989'

If I use Regex:

re.sub('[0-9]*\.\w*', '', this_string)

The result will be:

'lorum3 ipsum  bar foo  v more text 46  here and even more text here v'

I'm expecting the word v7.8.989 not to be removed, since it's started with a letter.

It will be great if the removed words aren't adding the unneeded space. My Regex code above still adds space.

Gedanggoreng
  • 186
  • 2
  • 11
  • What's the point of the `s?` in your regex? – Nick Oct 10 '22 at 05:57
  • Isn't used for anything that matches a single whitespace character? – Gedanggoreng Oct 10 '22 at 06:00
  • that would be `\s` – Nick Oct 10 '22 at 06:00
  • What do you want to do with something like `1.2.3c`? – Nick Oct 10 '22 at 06:04
  • If I use `\s`, it will only remove the first 2 parts of the number. But if I use `s?`, it will remove all the words that contain numbers and periods. – Gedanggoreng Oct 10 '22 at 06:05
  • `s?` means 0 or more `s` characters, so it's effectively doing nothing – Nick Oct 10 '22 at 06:05
  • Ah, I see. I will edit the question. – Gedanggoreng Oct 10 '22 at 06:06
  • 1.2.3c -> it shouldn't be removed also. I need the words that contain number and period which should be removed. Other than that, it should keep in the string. – Gedanggoreng Oct 10 '22 at 06:12
  • Use a *word boundary*, `\b`, in the regex, to indicate "at this point in the matching, we must be at either the beginning or end of a word". By putting `\b` before and after some chunk of the regex, we can match a word that matches that chunk. See the linked duplicate for details. – Karl Knechtel Oct 10 '22 at 06:12
  • @KarlKnechtel there are multiple word boundaries inside `15.2.3.9.7` so that duplicate is not relevant to this question – Nick Oct 10 '22 at 06:14
  • @KarlKnechtel If I use `\b`, the result will be -> `lorum3 ipsum bar foo v more text 46 here and even more text here v7` Notes that the numbers after v7 will be removed. – Gedanggoreng Oct 10 '22 at 06:16

4 Answers4

4

You can use this regex to match the strings you want to remove:

(?:^|\s)[0-9]+\.[0-9.]*(?=\s|$)

It matches:

  • (?:^|\s) : beginning of string or whitespace
  • [0-9]+ : at least one digit
  • \. : a period
  • [0-9.]* : some number of digits and periods
  • (?=\s|$) : a lookahead to assert end of string or whitespace

Regex demo

You can then replace any matches with the empty string. In python

this_string = 'lorum3 ipsum 15.2.3.9.7 bar foo 1. v more text 46 2. here and even more text here v7.8.989 and also 1.2.3c as well'
result = re.sub(r'(?:^|\s)[0-9]+\.[0-9.]*(?=\s|$)', '', this_string)

Output:

lorum3 ipsum bar foo v more text 46 here and even more text here v7.8.989 and also 1.2.3c as well
Nick
  • 138,499
  • 22
  • 57
  • 95
2

If you can make use of a lookbehind, you can match the numbers and replace with an empty string:

(?<!\S)\d+\.[\d.]*(?!\S)

Explanation

  • (?<!\S) Assert a whitespace boundary to the left
  • \d+\.[\d.]* Match 1+ digits, then a dot followed by optional digits or dots
  • (?!\S) Assert a whitespace boundary to the right

Regex demo

If you want to match an optional leading whitespace char:

\s?(?<!\S)\d+\.[\d.]*(?!\S)

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

If you don't want to use regex, you can also do it using simple string operations:

res = ''.join(['' if (e.startswith(('0','1','2','3','4','5','6','7','8','9')) and '.' in e) else e+' ' for e in this_string.split()])
mahesh
  • 1,028
  • 10
  • 24
1

You can try this regex:

(^|\s)\d[^\s]*\.+[^\s]*

This matches strings like '7.a.0.1' which contains letter extra.

Here is a demo.

M..
  • 26
  • 2
  • 1
    This would remove `1.2.3c`, which is not supposed to be removed (see comments to question) – Nick Oct 10 '22 at 11:29