1

I need to extract text from strings using regular expressions. Examples:

//localhost:8000/pmp/pml/vault/
//localhost:8000/pmp/bom/vault/
//localhost:8000/pmp/parts/advancedsearch/

The string may be a different localhost number or the front half may be a different address altogether. I need to extract from /pmp/ to the following slash. So:

/pmp/pml/
/pmp/bom/
/pmp/parts/

What is the regular expression I could use to extract that text. Also please provide details for what each component of the regular expression is doing as I am trying to learn this rather than just get the answer.

I have the following bit of regex, but it only works for when the string is split after the localhost number. Also, I don't know what any part of it means:

/[^/]*/([^/]*)/

justinxhan
  • 43
  • 4

4 Answers4

3

You don't need Regex for everything. Regex are hard to understand, hard to maintain and there are better solutions for many things.

from urllib.parse import urlparse
print(urlparse("//localhost:8000/pmp/parts/advancedsearch/"))

This code makes it clear for everyone that you are parsing an URL. The Regex doesn't convey this message.

Output:

ParseResult(scheme='', netloc='localhost:8000', path='/pmp/parts/advancedsearch/', params='', query='', fragment='')

As you can see, path is what you want to process next, e.g.

from urllib.parse import urlparse
url = urlparse("//localhost:8000/pmp/parts/advancedsearch/")
dirs = url.path.split("/")
print(f"/{dirs[1]}/{dirs[2]}/")
Thomas Weller
  • 55,411
  • 20
  • 125
  • 222
2

This skips over the hostname + portnumber, and captures whatever /pmp/someword/ immediately follows that.

import re

pmp_re = re.compile(r"^//localhost:\d+(/pmp/\w+/)")
if match := pmp_re.search(url):
    print(match[1])

The ^ carat anchor forces any match to begin at the beginning, and the ( ) parens define matching group #1. When looking for a \digit or \word character, the + insists on one-or-more matches.

https://regex101.com/r/KsOaBQ/1

This regex is fragile, as port 80 might come out as localhost:80/ or just localhost/. We could make the colon and digits optional using ? for zero-or-one and * for zero-or-more matches:

pmp_re = re.compile(r"^//localhost:?\d*(/pmp/\w+/)")

But it would be better to call urlparse() and then work with the path it returns.

When we study the last part of that regex, /\w+/, it's worth noting that / slash is not a \word character; it is neither alphanumeric nor _ underscore. We could use a fancy "not slash" regex of /[^/]+/, but that would be much less readable, so I recommend that if possible you should avoid going down that path. Humans can more easily read things stated in the positive than in the negative. There are also fancy "lazy" modifiers like /.+?/ that one might use, but again that won't improve the code's readability for a novice.

J_H
  • 17,926
  • 4
  • 24
  • 44
1

You can slightly adjust your regex attempt from /[^/]*/([^/]*)/ to //[^/]+(/[^/]+/[^/]+/).*.

  • // : Matches literal double slashes at the start
  • [^/]+ : Matches any sequence of characters (different than a slash)
  • (/[^/]+/[^/]+/) : Matches the desired portion/path enclosed and making a group
  • .* : Matches any sequence of characters (zero or more)

Regex : [demo]

Test/Output :

import re

list_of_urls = [
    "//localhost:8000/foo/pml/vault/",
    "//localhost:8000/bar/bom/vault/",
    "//localhost:8000/baz/parts/advancedsearch/",
]

def get_path(url):
    m = re.search(r"//[^/]+(/[^/]+/[^/]+/).*", url)
    return m.group(1) if m else None

for url in list_of_urls:
    print(get_path(url))

/foo/pml/
/bar/bom/
/baz/parts/
Timeless
  • 22,580
  • 4
  • 12
  • 30
0

You can use the following.

(/pmp/.+?/)

The parentheses denote a capture group.

The . character will match any character.
The subsequent + is a quantifier, which will attempt to match one or more of the preceded element.
And, the ? is a modifier, to the quantifier, specifying that it should match as little as possible, as opposed the as much as possible—the default.

Reilas
  • 3,297
  • 2
  • 4
  • 17