2

using python3's regex capabilities, is it possible to capture variable numbers of capture blocks, based on the number of the repetitions found? for instance, in the following search strings, i want to capture all the digit strings with the same regex.

search string 1(trying to capture: 89, 45):

zzz89zzz45.mp3

search string 2(trying to capture: 98, 67, 89, 45):

zzz98zzz67zzz89zzz45.mp3

search string 3(trying to capture: 98, 67, 89, 45, 55, 111):

zzz98zzz67zzz89zzz45vdvd55lplp111.mp3

the following regex will match all the repetitions, though all the values are not available for later use(only 1 digit string is captured):

((\d+)\D*)*\.mp3$

the other 2 options are writing a different regex for every case, or use findall(). Is there a way to adjust the above regex in order to capture every digit string for later use with various numbers of repetitions using just regex facilities, or to do this in python3, are you forced to use findall()?

ryan_m
  • 721
  • 2
  • 7
  • 18
  • nothing's wrong with findall, i'm using it in my code now. I'm just trying to better understand regex's. – ryan_m Jul 13 '11 at 03:45

2 Answers2

3

This will match all the numbers before the dot:

s = "zzz98zzz67zzz89zzz45vdvd55lplp111.mp3"
res = re.findall("[0-9]+(?=.*\\.)", s)
print(res)
Petar Ivanov
  • 91,536
  • 11
  • 82
  • 95
  • it will, but you are using findall(). I would like to know if you can do it using just a regex, not the, admittedly useful, extra functions python3 gives you. – ryan_m Jul 13 '11 at 03:41
  • this is using regex - the parameter to findall is a regex, isn't it? – Petar Ivanov Jul 13 '11 at 03:44
  • in my code, i strip the .mp3, then do a findall('\d+'). While '\d+' is a regex, i'm interested in if it's possible in python3 to do this with a "bare" regex w/o using something like findall(). I'm interested in whether this is the kind of problem that a regex can deal with, or if you need something like findall() in this circumstance. – ryan_m Jul 13 '11 at 03:55
3

Most or all regular expression engines in common use, including in particular those based on the PCRE syntax (like Python's), label their capturing groups according to the numerical index of the opening parenthesis, as the regex is written. So no, you cannot use capturing groups alone to extract an arbitrary, variable number of subsequences from a string.

The closest you can get (as far as I know) is to manually write out a certain number of capturing groups, something like this:

s = ...
res = re.match(r'\D*' + 25 * r'(\d+)\D+')
numbers = [r for r in res.groups() if r is not None]

This will get you up to 25 groups of digits. If you need more, replace 25 with some higher number.

I wouldn't be surprised if this were less efficient than the iterative approach with findall(), although I haven't tested it.

David Z
  • 128,184
  • 27
  • 255
  • 279
  • thanks. that's just what i was looking for. i figured, python being python and all, findall() wouldn't be available if we didn't need it, but i just wanted to make sure. – ryan_m Jul 13 '11 at 04:36
  • "findall() wouldn't be available if we didn't need it"... huh? In any case, for what it's worth, if I were doing this myself I would almost certainly use `findall()`. – David Z Jul 13 '11 at 17:22
  • Oh, I see. In this case, `findall()` is the one obvious way to do it. This sort of thing is exactly what the function is there for. (Although the designers likely had much longer strings in mind, for which it would be absurdly inefficient to use capturing groups.) Don't take that mantra _too_ literally, though. It's the nature of programming that most tasks can be accomplished in multiple ways, some good and some not. – David Z Jul 13 '11 at 22:46