0

I have this kind of strings:

string = '
something .... something else ...
url="/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../...."
other things ...
'

without any CR/LF, it is all on one line.

I want to create a regex which:

  • if and only if the url starts with /transfer/packages/
  • captures each subsequent GUID
  • until the end of the quoted string "
  • the number of GUID to be found is unknown and is at least one

So far I wrote:

\/transfer\/packages\/[^"]*([A-Za-z0-9]{8}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{12})"

but it only capture the LAST guid. I need some how to reuse the prefix /transfer/packages/ and keep matching eagerly expanding the search each time without moving on from the prefix.

edoedoedo
  • 1,469
  • 2
  • 21
  • 32
  • So you want to match each one of the GUIDs as long as they are in a string that starts with `/transfer/packages/`, yeah? – sp00m Apr 15 '20 at 08:59
  • You're right. (Later on in the same string there may be others url="/transfer/package/..." and they must be captured as well, but I don't think this is relevant because if so it is entirely another piece of the string) – edoedoedo Apr 15 '20 at 09:01
  • What app are you using? Btw, `-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}-[A-Za-z0-9]{4}` is better written `(?:-[A-Za-z0-9]{4}){3}` maybe? – JvdV Apr 15 '20 at 09:04
  • 1
    And instead of trying to do this in one go, maybe use two steps. First check if your string starts with your requirement, then use a pattern like `[A-Za-z0-9]{8}(?:-[A-Za-z0-9]{4}){3}-[A-Za-z0-9]{12}` to find all substrings of interest. – JvdV Apr 15 '20 at 09:15
  • Yes that solves the problem indeed. However I'm curious whether this can be done in a single passage! – edoedoedo Apr 15 '20 at 10:20
  • What is the language? Sure it can be done, but not in all languages. – Wiktor Stribiżew Apr 15 '20 at 11:07
  • I'm using the `re` library in python3.6, however shouldn't this be language independent? – edoedoedo Apr 15 '20 at 11:34
  • Then it cannot. – Wiktor Stribiżew Apr 15 '20 at 16:38

3 Answers3

2

If you are using re module in Python then maybe use str.startwith and try:

import re
url="/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../...."
if url.startswith('/transfer/packages/'):
    Guid_List = re.findall(r'(?i)[a-z0-9]{8}(?:-[a-z0-9]{4}){3}-[a-z0-9]{12}', url)
print(Guid_List)
JvdV
  • 70,606
  • 8
  • 39
  • 70
  • Thank you, however the `url=""` is itself part of the string to search into, I guess that's not clear from my original post, I'll edit it. – edoedoedo Apr 15 '20 at 12:40
1

From this SO answer:

As for the second question, it is a common problem. It is not possible to get an arbitrary number of captures with a PCRE regex, as in case of repeated captures only the last captured value is stored in the group buffer. You cannot have more submatches in the resulting array than the number of capturing groups inside the regex pattern. See Repeating a Capturing Group vs. Capturing a Repeated Group for more details.

samthegolden
  • 1,366
  • 1
  • 10
  • 26
1

You could make use of the PyPi regex module that supports infinite length quantifiers in the lookbehind:

(?<=url="/transfer/packages/[^\r\n"]*)[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}(?=[^\r\n"]*")

Example Regex demo (with another engine selected for demo purpose) or see a Python demo


Another option is to first match the line that has url="/transfer/packages/followed by a guid and match until the next double quote.

Then you could use for example re.findall to get all the guids.

"/transfer/packages/[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}[^"\r\n]*"

Regex demo | Python demo

For example:

import re

regex = r'"/transfer/packages/[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}[^"\r\n]*"'
test_str = ("something .... something else ...\n"
    "url=\"/transfer/packages/00000000-0000-0000-0000-000000000000/connectors/68f74d66-ca3d-4272-9b59-4f737946b3f7/something/138bb190-3b12-4855-88e2-0d1cdf46aeb5/...../...../...../...../....\"\n"
    "other things ...\n\n"
    "68f74d66-ca3d-4272-9b59-4f737946b300")

for str in re.findall(regex, test_str):
    print(re.findall(r"[A-Za-z0-9]{8}-(?:[A-Za-z0-9]{4}-){3}[A-Za-z0-9]{12}", str))

Output

['00000000-0000-0000-0000-000000000000', '68f74d66-ca3d-4272-9b59-4f737946b3f7', '138bb190-3b12-4855-88e2-0d1cdf46aeb5']
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • 1
    I've been wrestling with `regex` module for a good hour now. Couldn crack it. Well done. I kept looking at `\G` options but I am too unexperienced to make that work. + – JvdV Apr 15 '20 at 13:42
  • 1
    @JvdV If you want to use the `\G` you have to find a way to get continuous matches using the position at the end of the previous match. One option could be https://regex101.com/r/Yme3VZ/1. Chances are that this pattern can be simplified :-) – The fourth bird Apr 15 '20 at 14:15