Reuse the same prefix to find the next match if any

Question

I have this kind of strings:

string = '
something .... something else ...
url="/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../...."
other things ...
'

without any CR/LF, it is all on one line.

I want to create a regex which:

if and only if the url starts with /transfer/packages/
captures each subsequent GUID
until the end of the quoted string "
the number of GUID to be found is unknown and is at least one

So far I wrote:

\/transfer\/packages\/[^"]*([A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12})"

but it only capture the LAST guid. I need some how to reuse the prefix /transfer/packages/ and keep matching eagerly expanding the search each time without moving on from the prefix.

So you want to match each one of the GUIDs as long as they are in a string that starts with `/transfer/packages/`, yeah? — sp00m, Apr 15 '20 at 08:59
You're right. (Later on in the same string there may be others url="/transfer/package/..." and they must be captured as well, but I don't think this is relevant because if so it is entirely another piece of the string) — edoedoedo, Apr 15 '20 at 09:01
What app are you using? Btw, `-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}` is better written `(?:-[A-Za-z0-9]{4}){3}` maybe? — JvdV, Apr 15 '20 at 09:04
And instead of trying to do this in one go, maybe use two steps. First check if your string starts with your requirement, then use a pattern like `[A-Za-z0-9]{8}(?:-[A-Za-z0-9]{4}){3}-[A-Za-z0-9]{12}` to find all substrings of interest. — JvdV, Apr 15 '20 at 09:15
Yes that solves the problem indeed. However I'm curious whether this can be done in a single passage! — edoedoedo, Apr 15 '20 at 10:20
What is the language? Sure it can be done, but not in all languages. — Wiktor Stribiżew, Apr 15 '20 at 11:07
I'm using the `re` library in python3.6, however shouldn't this be language independent? — edoedoedo, Apr 15 '20 at 11:34

JvdV · Answer 1 · 2020-04-15T13:39:55.597

2

If you are using re module in Python then maybe use str.startwith and try:

import re
url="/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../...."
if url.startswith('/transfer/packages/'):
    Guid_List = re.findall(r'(?i)[a-z0-9]{8}(?:-[a-z0-9]{4}){3}-[a-z0-9]{12}', url)
print(Guid_List)

edited Apr 15 '20 at 13:39

answered Apr 15 '20 at 12:35

JvdV

70,606
8
39
70

Thank you, however the `url=""` is itself part of the string to search into, I guess that's not clear from my original post, I'll edit it. – edoedoedo Apr 15 '20 at 12:40

score 1 · Answer 2 · answered Apr 15 '20 at 11:12

1

From this SO answer:

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.

answered Apr 15 '20 at 11:12

samthegolden

1,366
1
10
26

I understand. So the only generic way would be to do it in two steps, right? – edoedoedo Apr 15 '20 at 12:37
I don't know if that could be accomplished with two steps. The recursion can be done with two steps, but in PCRE it only captures a group as a whole and does not split them. – samthegolden Apr 15 '20 at 15:32
Thanks, very interesting to know. – edoedoedo Apr 16 '20 at 08:07

The fourth bird · Accepted Answer · 2020-04-15T13:45:57.410

You could make use of the PyPi regex module that supports infinite length quantifiers in the lookbehind:

(?<=url="/transfer/packages/[^\r\n"]*)[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}(?=[^\r\n"]*")

Example Regex demo (with another engine selected for demo purpose) or see a Python demo

Another option is to first match the line that has url="/transfer/packages/followed by a guid and match until the next double quote.

Then you could use for example re.findall to get all the guids.

"/transfer/packages/[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}[^"\r\n]*"

Regex demo | Python demo

For example:

import re

regex = r'"/transfer/packages/[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}[^"\r\n]*"'
test_str = ("something .... something else ...\n"
    "url=\"/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../....\"\n"
    "other things ...\n\n"
    "68f74d66-ca3d-4272-9b59-4f737946b300")

for str in re.findall(regex, test_str):
    print(re.findall(r"[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}", str))

Output

['00000000-0000-0000-0000-000000000000', '68f74d66-ca3d-4272-9b59-4f737946b3f7', '138bb190-3b12-4855-88e2-0d1cdf46aeb5']

I've been wrestling with `regex` module for a good hour now. Couldn crack it. Well done. I kept looking at `\G` options but I am too unexperienced to make that work. + — JvdV, Apr 15 '20 at 13:42
@JvdV If you want to use the `\G` you have to find a way to get continuous matches using the position at the end of the previous match. One option could be https://regex101.com/r/Yme3VZ/1. Chances are that this pattern can be simplified :-) — The fourth bird, Apr 15 '20 at 14:15

Reuse the same prefix to find the next match if any

3 Answers3