Regex group doesn't capture all of matched part of string

Question

I have the following regex: '(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$'.

Given a string the following string "/foo/bar/baz", I expect the first captured group to be "/foo/bar". However, I get the following:

>>> import re
>>> regex = re.compile('(/[a-zA-Z]+)*/([a-zA-Z]+)\.?$');
>>> match = regex.match('/foo/bar/baz')
>>> match.group(1)
'/bar'

Why isn't the whole expected group being captured?

Edit: It's worth mentioning that the strings I'm trying to match are parts of URLs. To give you an idea, it's the part of the URL that would be returned from window.location.pathname in javascript, only without file extensions.

What are you trying to extract here and what is the logic? – Tim Biegeleisen Sep 16 '21 at 11:27 — Tim Biegeleisen, Sep 16 '21 at 11:27

score 3 · Accepted Answer · answered Sep 16 '21 at 11:32

3

This will capture multiple repeated groups:

(/[a-zA-Z]+)*

However, as already discussed in another thread, quoting from @ByteCommander

If your capture group gets repeated by the pattern (you used the + quantifier on the surrounding non-capturing group), only the last value that matches it gets stored.

Thus the reason why you are only seeing the last match "/bar". What you can do instead is take advantage of the greedy matching of .* up to the last / via the pattern (/.*)/

regex = re.compile('(/.*)/([a-zA-Z]+)\.?$');

answered Sep 16 '21 at 11:32

Niel Godfrey Pablo Ponciano

9,822
1
17
30

This mostly answers my question. Wouldn't `(/.*)/` match any characters? For example, `'/_&@!@#%/'` would match. What do I do if I want to use greedy matching but only match certain characters? – Anthony Bias Sep 16 '21 at 11:51
1

Greedy matching works by matching up to the last possible character **BUT** still satisfying the whole clause. So if the text is `abcde`, and the pattern is `(.*)e`, the `(.*)` wouldn't capture everything, only `abcd`. If you want specific characters, just change it to something like `[/\w\d]*/` where that slash after the asterisk would be the last possible slash that would still satisfy the whole pattern. – Niel Godfrey Pablo Ponciano Sep 16 '21 at 12:00

score 0 · Answer 2 · answered Sep 16 '21 at 11:27

0

Don't need the * between the two expressions here, also move the first / into the brackets:

>>> regex = re.compile('([/a-zA-Z]+)/([a-zA-Z]+)\.?$')
>>> regex.match('/foo/bar/baz').group(1)
'/foo/bar'
>>>

answered Sep 16 '21 at 11:27

U13-Forward

69,221
14
89
114

BlackMath · Answer 3 · 2021-09-16T11:45:30.520

0

In this case, you may don't need regex. You can simply use split function.

text = "/foo/bar/baz"
"/".join(text.split("/", 3)[:3])

output:

/foo/bar

a.split("/", 3) splits your string up to the third occurrence of /, and then you can join the desidered elements.

As suggested by Niel, you can use a negative index to extract anything but the last part from a url (or a path).

In this case the generic approach would be :

text = "/foo/bar/baz/boo/bye"
"/".join(text.split("/", -1)[:-1])

Output:

/foo/bar/baz/boo

edited Sep 16 '21 at 11:45

answered Sep 16 '21 at 11:30

BlackMath

1,708
1
11
14

You might want to make it more generic because if the URL changed to `"/foo/doo/bar/baz"` which is very likely for URLs, this will then fail due to its static indexing of `"3"`. Probably make `"-1"` would work too. – Niel Godfrey Pablo Ponciano Sep 16 '21 at 11:38
Thanks Niel, i edited my answer for a generic approach. – BlackMath Sep 16 '21 at 11:46

Regex group doesn't capture all of matched part of string

3 Answers3