Python - How to Remove Words That Started With Number and Contain Period

Question

What is the best way to remove words in a string that start with numbers and contain periods in Python?

this_string = 'lorum3 ipsum 15.2.3.9.7 bar foo 1. v more text 46 2. here and even more text here v7.8.989'

If I use Regex:

re.sub('[0-9]*\.\w*', '', this_string)

The result will be:

'lorum3 ipsum  bar foo  v more text 46  here and even more text here v'

I'm expecting the word v7.8.989 not to be removed, since it's started with a letter.

It will be great if the removed words aren't adding the unneeded space. My Regex code above still adds space.

Isn't used for anything that matches a single whitespace character? — Gedanggoreng, Oct 10 '22 at 06:00
If I use `\s`, it will only remove the first 2 parts of the number. But if I use `s?`, it will remove all the words that contain numbers and periods. — Gedanggoreng, Oct 10 '22 at 06:05
`s?` means 0 or more `s` characters, so it's effectively doing nothing — Nick, Oct 10 '22 at 06:05
1.2.3c -> it shouldn't be removed also. I need the words that contain number and period which should be removed. Other than that, it should keep in the string. — Gedanggoreng, Oct 10 '22 at 06:12
Use a *word boundary*, `\b`, in the regex, to indicate "at this point in the matching, we must be at either the beginning or end of a word". By putting `\b` before and after some chunk of the regex, we can match a word that matches that chunk. See the linked duplicate for details. — Karl Knechtel, Oct 10 '22 at 06:12
@KarlKnechtel there are multiple word boundaries inside `15.2.3.9.7` so that duplicate is not relevant to this question — Nick, Oct 10 '22 at 06:14
@KarlKnechtel If I use `\b`, the result will be -> `lorum3 ipsum bar foo v more text 46 here and even more text here v7` Notes that the numbers after v7 will be removed. — Gedanggoreng, Oct 10 '22 at 06:16

Nick · Accepted Answer · 2022-10-11T02:51:21.313

4

You can use this regex to match the strings you want to remove:

(?:^|\s)[0-9]+\.[0-9.]*(?=\s|$)

It matches:

(?:^|\s) : beginning of string or whitespace
[0-9]+ : at least one digit
\. : a period
[0-9.]* : some number of digits and periods
(?=\s|$) : a lookahead to assert end of string or whitespace

Regex demo

You can then replace any matches with the empty string. In python

this_string = 'lorum3 ipsum 15.2.3.9.7 bar foo 1. v more text 46 2. here and even more text here v7.8.989 and also 1.2.3c as well'
result = re.sub(r'(?:^|\s)[0-9]+\.[0-9.]*(?=\s|$)', '', this_string)

Output:

lorum3 ipsum bar foo v more text 46 here and even more text here v7.8.989 and also 1.2.3c as well

edited Oct 11 '22 at 02:51

answered Oct 10 '22 at 06:21

Nick

138,499
22
57
95

Thank you. It works like a charm. I've been trying to figure this out for a couple of days. – Gedanggoreng Oct 10 '22 at 06:29
`(..., '', 0, re.I)` Can you explain what is the need for using 0 and re.I parameter? – Gedanggoreng Oct 10 '22 at 06:31
@Gedanggoreng sorry - they're not required. When I thought you wanted to match `1.2.3c` I put them in so the regex would be case-insensitive. I've edited them out – Nick Oct 10 '22 at 06:33
Ah, okay. Thank you once again. – Gedanggoreng Oct 10 '22 at 06:34

The fourth bird · Answer 2 · 2022-10-10T09:38:23.337

2

If you can make use of a lookbehind, you can match the numbers and replace with an empty string:

(?<!\S)\d+\.[\d.]*(?!\S)

Explanation

(?<!\S) Assert a whitespace boundary to the left
\d+\.[\d.]* Match 1+ digits, then a dot followed by optional digits or dots
(?!\S) Assert a whitespace boundary to the right

Regex demo

If you want to match an optional leading whitespace char:

\s?(?<!\S)\d+\.[\d.]*(?!\S)

Regex demo

edited Oct 10 '22 at 09:38

answered Oct 10 '22 at 09:33

The fourth bird

154,723
16
55
70

1

Well given that two incorrect answers have been upvoted the least I can do is bring yours in line... – Nick Oct 10 '22 at 11:32
@Gedanggoreng this is a better answer, you should accept it instead. – Nick Oct 10 '22 at 11:36
@Nick Its all good, I am on a holiday..woke up, quickly posted an answer and went to the beach. Now that I have some time reading all the answers, your answer looks fine to me. – The fourth bird Oct 10 '22 at 11:39
Sure, it's fine, but not as optimal as this. – Nick Oct 10 '22 at 11:40
@Nick Yeah, I really start appreciating a keyboard doing this on a mobile phone – The fourth bird Oct 10 '22 at 11:42
1

That just makes it even more impressive! :) enjoy your break. – Nick Oct 10 '22 at 11:43
Thank you, it's working great. @Nick Are you sure it's better to accept this answer? – Gedanggoreng Oct 11 '22 at 01:46
@Gedanggoreng up to you. I improved my answer (removing a lookahead that was left over from the letters time) so now they're pretty similar. – Nick Oct 11 '22 at 02:51
1

I think it's better to stay with @Nick's answer. I believe The fourth bird will not object with this. – Gedanggoreng Oct 11 '22 at 13:32
@Gedanggoreng I don't object at all :-) – The fourth bird Oct 11 '22 at 13:58

score 1 · Answer 3 · answered Oct 10 '22 at 06:40

1

If you don't want to use regex, you can also do it using simple string operations:

res = ''.join(['' if (e.startswith(('0','1','2','3','4','5','6','7','8','9')) and '.' in e) else e+' ' for e in this_string.split()])

answered Oct 10 '22 at 06:40

mahesh

1,028
10
24

2

This removes `1.2.3c`, which is not supposed to be removed (see comments to question) – Nick Oct 10 '22 at 08:06
@Nick You are right. I think then we can simply extend this principle to exclude such cases too. – mahesh Oct 12 '22 at 01:35
Go on then..... – Nick Oct 12 '22 at 02:11

M.. · Answer 4 · 2022-10-10T08:37:26.560

1

You can try this regex:

(^|\s)\d[^\s]*\.+[^\s]*

This matches strings like '7.a.0.1' which contains letter extra.

Here is a demo.

edited Oct 10 '22 at 08:37

answered Oct 10 '22 at 08:24

M..

26
2

1

This would remove `1.2.3c`, which is not supposed to be removed (see comments to question) – Nick Oct 10 '22 at 11:29

Python - How to Remove Words That Started With Number and Contain Period

4 Answers4