-1

I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.

Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:

>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'

Why isn't the whole expected group being captured?

Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.

Anthony Bias
  • 515
  • 3
  • 20

3 Answers3

3

This will capture multiple repeated groups:

(/[a-zA-Z]+)*

However, as already discussed in another thread, quoting from @ByteCommander

If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/

regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');
  • This mostly answers my question. Wouldn't `(/.*)/` match any characters? For example, `'/_&@!@#%/'` would match. What do I do if I want to use greedy matching but only match certain characters? – Anthony Bias Sep 16 '21 at 11:51
  • 1
    Greedy matching works by matching up to the last possible character **BUT** still satisfying the whole clause. So if the text is `abcde`, and the pattern is `(.*)e`, the `(.*)` wouldn't capture everything, only `abcd`. If you want specific characters, just change it to something like `[/\w\d]*/` where that slash after the asterisk would be the last possible slash that would still satisfy the whole pattern. – Niel Godfrey Pablo Ponciano Sep 16 '21 at 12:00
0

Don't need the * between the two expressions here, also move the first / into the brackets:

>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>> 
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
0

In this case, you may don't need regex. You can simply use split function.

text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])

output:

/foo/bar

a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.

As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).

In this case the generic approach would be :

text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])

Output:

/foo/bar/baz/boo
BlackMath
  • 1,708
  • 1
  • 11
  • 14