2

I want to extract UUID from urls.

for example:

/posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
/posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
/posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf

I have thousands of this kind of string.

My regex now is ".*\/posts\/(.*)[/?]+.*" which gives me the result like this:

d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid
84ba0472-926d-4f50-b3c6-46376b2fe9de/uid
6f3c97c1-b877-40e0-9479-6bdb826b7b8f/uid
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f5e5dc6a-f42b-47d1-8ab1-6ae533415d24
f7842dce-73a3-4984-bbb0-21d7ebce1749
fdc6c48f-b124-447d-b4fc-bb528abb8e24

As you can see, my regex can't get rid of /uid, but handle ?xxxx, query parameter, fine.

What did I miss? How to make it right?

Thanks

Lucas Shen
  • 327
  • 6
  • 14
  • Did you try search first? http://stackoverflow.com/questions/136505/searching-for-uuids-in-text-with-regex and http://stackoverflow.com/questions/7905929/how-to-test-valid-uuid-guid – fukanchik May 18 '16 at 21:46
  • good pointers. I searched wrong keywords. @fukanchik – Lucas Shen May 18 '16 at 21:49

2 Answers2

3

The .* pattern is too broad and greedy for a UUID:

>>> import re
>>> data = """
... /posts/eb8c6d25-8784-4cdf-b016-4d8f6df64a62?mc_cid=37387dcb5f&mc_eid=787bbeceb2
... /posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034
... /posts/5ff0021c-16cd-4f66-8881-ee28197ed1cf
... """
>>> 
>>> re.findall(r"/posts/([A-Za-z0-9\-]+)", data)
['eb8c6d25-8784-4cdf-b016-4d8f6df64a62', 
 'd78fa5da-4cbb-43b5-9fae-2b5c86f883cb', 
 '5ff0021c-16cd-4f66-8881-ee28197ed1cf']

Or, you can be more strict about the UUID format, see more:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
2

Regular expressions try to match as many characters as possible (informally called "maximal munch").

A plain-English description of your regex .*\/posts\/(.*)[/?]+.* would be something like:

Match anything, followed by /posts/, followed by anything, followed by one or more /?, followed by anything.

When we apply that regex to this text:

.../posts/d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid/7034

... the maximal munch rule demands that the second "anything" match be as long as possible, therefore it ends up matching more than you wanted:

d78fa5da-4cbb-43b5-9fae-2b5c86f883cb/uid

... because there is still the /7034 part remaining, which matches the remainder of the regex.

The best way to fix it is to use a regex which only matches characters that can actually occur in a UID (as suggested by @alecxe).

John Gordon
  • 29,573
  • 7
  • 33
  • 58