Python Regular Expression; replacing a portion of match

Question

How would I limit match/replacement the leading zeros in e004_n07? However, if either term contains all zeros, then I need to retain one zero in the term (see example below). For the input string, there will always be 3 digits in the first value and 2 digits in the second value.

Example input and output

e004_n07 #e4_n7
e020_n50 #e20_n50
e000_n00 #e0_n0

Can this be accomplished with re.sub alone, or do I need to use re.search/re.match?

Do you want to replace all zeros or just leading zeros? How do you want to handle e.g. `e020_n50`? — user4815162342, Aug 29 '16 at 19:24
What are the rules? Replace zeros after letters? Use `re.sub(r'([a-zA-Z])0+', r'\1', s)` — Wiktor Stribiżew, Aug 29 '16 at 19:29
@SSH Please edit the question to indicate that you only care about removing the leading zeros. — user4815162342, Aug 29 '16 at 19:38

Wiktor Stribiżew · Accepted Answer · 2016-08-31T06:59:16.123

5

If you want to only remove zeros after letters, you may use:

([a-zA-Z])0+

Replace with \1 backreference. See the regex demo.

The ([a-zA-Z]) will capture a letter and 0+ will match 1 or more zeros.

Python demo:

import re
s = 'e004_n07'
res = re.sub(r'([a-zA-Z])0+', r'\1', s)
print(res)

Note that re.sub will find and replace all non-overlapping matches (will perform a global search and replace). If there is no match, the string will be returned as is, without modifications. So, there is no need using additional re.match/re.search.

UDPATE

To keep 1 zero if the numbers only contain zeros, you may use

import re
s = ['e004_n07','e000_n00']
res = [re.sub(r'(?<=[a-zA-Z])0+(\d*)', lambda m: m.group(1) if m.group(1) else '0', x) for x in s]
print(res)

See the Python demo

Here, r'(?<=[a-zA-Z])0+(\d*)' regex matches one or more zeros (0+) that are after an ASCII letter ((?<=[a-zA-Z])) and then any other digits (0 or more) are captured into Group 1 with (\d*). Then, in the replacement, we check if Group 1 is empty, and if it is empty, we insert 0 (there are only zeros), else, we insert Group 1 contents (the remaining digits after the first leading zeros).

edited Aug 31 '16 at 06:59

answered Aug 29 '16 at 19:33

Wiktor Stribiżew

607,720
39
448
563

Thanks! One follow-up: I also need to handle the special case, 'e000_n00', where both terms contain all zeros. In this case, I would like the output to be: 'e0_n0'. So the rule would be modified as follows: Remove leading zeros in both terms EXCEPT for the case where either term contains all zeros. In this case, do not remove the final zero. I have found that the following expression seems to work, but would like to know if it is an appropriate solution: res = re.sub(r'([a-zA-Z])0{2}|([_][a-zA-Z])0{1}, r'\1\2', s) – SSH Aug 31 '16 at 04:42
It is interesting: are you using Python 3.5? Your `([a-zA-Z])0{2}|([_][a-zA-Z])0{1}` will not work with `e010_n01` and it won't work with Python versions before 3.5. – Wiktor Stribiżew Aug 31 '16 at 07:13
In Python 3.5, you can also use `re.sub(r'([a-zA-Z])(?=((0)+))\2(?!\d)|([a-zA-Z])0+(\d*)', r'\1\3\4\5', s)`, but I would not recommend that monstrous pattern. In earlier Python versions you will get an *unmatched group* error. – Wiktor Stribiżew Aug 31 '16 at 07:19

score 1 · Answer 2 · edited Sep 27 '17 at 04:20

1

There's no need to use re.sub if your replacement is so simple - simply use str.replace:

s = 'e004_n07'
s.replace('0', '') # => 'e4_n7'

edited Sep 27 '17 at 04:20

Graham

7,431
18
59
84

answered Aug 29 '16 at 19:21

Rushy Panchal

16,979
16
61
94

score 0 · Answer 3 · answered Aug 29 '16 at 19:33

If your requirement is that you MUST use regex, then below is your regex pattern:

>>> import re
>>> s = 'e004_n07'
>>> line = re.sub(r"0", "", s)
>>> line
'e4_n7'

However it is recommended not to use regex when there is other efficient way to perform the same opertaion, i.e. using replace function

>>> line = s.replace('0', '')
>>> line
'e4_n7'

score 0 · Answer 4 · 2016-08-29T19:58:08.900

edit: Don't let anybody talk you out of validating the format of the fixed data. If that's what you need, don't settle for something overly simple .

Not very pretty, but in a situation that seems fixed, you can just
set all the permutations, then blindly capture the good parts,
leave out the zero's then substitute it all back.

Find ([a-z])(?:([1-9][0-9][0-9])|0([1-9][0-9])|00([1-9]))(_[a-z])(?:([1-9][0-9])|0([1-9]))

Replace $1$2$3$4$5$6$7

Expanded

 ( [a-z] )                     # (1)
 (?:
      ( [1-9] [0-9] [0-9] )         # (2)
   |  
      0
      ( [1-9] [0-9] )               # (3)
   |  
      00
      ( [1-9] )                     # (4)
 )
 ( _ [a-z] )                   # (5)
 (?:
      ( [1-9] [0-9] )               # (6)
   |  
      0
      ( [1-9] )                     # (7)
 )

Output

 **  Grp 0 -  ( pos 0 , len 8 ) 
e004_n07  
 **  Grp 1 -  ( pos 0 , len 1 ) 
e  
 **  Grp 2 -  NULL 
 **  Grp 3 -  NULL 
 **  Grp 4 -  ( pos 3 , len 1 ) 
4  
 **  Grp 5 -  ( pos 4 , len 2 ) 
_n  
 **  Grp 6 -  NULL 
 **  Grp 7 -  ( pos 7 , len 1 ) 
7

Let's see, it validates the entire paired value and does it in one pass. Let's check the performance Regex1: ([a-zA-Z])0+ Options: < none > Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 10 Elapsed Time: 0.40 s, 404.17 ms, 404168 µs Regex2: ([a-z])(?:([1-9][0-9][0-9])|0([1-9][0-9])|00([1-9]))(_[a-z])(?:([1-9][0-9])|0([1-9])) Options: < none > Completed iterations: 50 / 50 ( x 1000 ) Matches found per iteration: 5 Elapsed Time: 0.47 s, 465.32 ms, 465317 µs — , Aug 29 '16 at 19:54

Python Regular Expression; replacing a portion of match

4 Answers4

Linked

Related