3

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:

/dev/sda1       472437724  231650856 216764652  52% /

I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".

I tried to code this as .*(\d*)%.* but the group is not matched:

  • .* match anything, any number of times
  • % ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
  • (\d*) ... and now before that % you had a (\d*) to match and group
  • .* ... and the rest does not matter (match everything)
WoJ
  • 27,165
  • 48
  • 180
  • 345
  • You match nothing because the digits are optional. Try using a word boundary or match a space before `^.*\b(\d+)%.*` https://regex101.com/r/niKGIX/1 – The fourth bird Aug 02 '19 at 14:26
  • What about \w*% ? – GSazheniuk Aug 02 '19 at 14:27
  • (\d{1,3})% should be enough – Oflocet Aug 02 '19 at 14:28
  • I am not sure it’s relevant in your case, but I had to solve a similar “backwards” problem. What I ended up doing was reversing the string and then writing a regex that operated on the reversed string. Worked very well as the particular data structure was easier to parse right to left. – JL Peyret Aug 04 '19 at 05:37

5 Answers5

3

Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..

And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.

In fact, I think your text can be matched simply by:

(\d{1,3})%

And getting group 1.

The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?

Sweeper
  • 213,210
  • 22
  • 193
  • 313
2

If you are just looking to extract just the number then I would use:

import re
pattern = r"\d*(?=%)"
string = "/dev/sda1   472437724  231650856 216764652  52% /"
returnedMatches = re.findall(pattern, string)

The regex expression does a positive look ahead for the special character

flokibb
  • 41
  • 5
1

In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.

The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.

What you might do is add a word boundary or a space before the digits:

.* (\d{1,3})%.* or .*\b(\d{1,3})%.*

Regex demo 1 Or regex demo 2

Note that using .* (greedy) you will get the last instance of the digits and the % sign.

If you would make it non greedy, you would match the first occurrence:

.*?(\d{1,3})%.*

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:

"/dev/sda1       472437724  231650856 216764652  52"

This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.

In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":

' (\d*)%'
Gershom Maes
  • 7,358
  • 2
  • 35
  • 55
0

Try this:

.*(\b\d{1,3}(?=\%)).*

demo

Mohammad Ali Amini
  • 174
  • 2
  • 2
  • 9