Parsing URL with regex

Question

I'm trying to combine if else inside my regular expression, basically if some patterns exists in the string, capture one pattern, if not, capture another.

The string is: 'https://www.searchpage.com/searchcompany.aspx?companyId=41490234&page=0&leftlink=true" and I want to extract staff around the '?"

So if '?' is detected inside the string, the regular expression should capture everything after the '?' mark; if not, then just capture from the beginning.

I used:'(.*\?.*)?(\?.*&.*)|(^&.*)' But it didn't work...

Any suggestion?

Thanks!

If you can guarantee that there won't be any other question marks later, you could use something like `r".*?\??([^?]+)"`. — Tom Hunt, Feb 19 '15 at 22:18
thanks for reply. But this still captures the 'https://www.search..' part. But I actually want to capture it happens when there's no question mark detected.. — JudyJiang, Feb 19 '15 at 22:20
Why not use [`urlparse`](https://docs.python.org/2/library/urlparse.html)? It allows you to get all the parts of the URL. — Peter Wood, Feb 19 '15 at 22:21
possible duplicate of [Best way to parse a URL query string](http://stackoverflow.com/questions/10113090/best-way-to-parse-a-url-query-string) — Peter Wood, Feb 20 '15 at 09:40

score 5 · Answer 1 · answered Feb 19 '15 at 22:32

Use urlparse:

>>> import urlparse
>>> parse_result = urlparse.urlparse('https://www.searchpage.com/searchcompany.aspx?
companyId=41490234&page=0&leftlink=true')

>>> parse_result
ParseResult(scheme='https', netloc='www.searchpage.com', 
path='/searchcompany.aspx', params='', 
query='companyId=41490234&page=0&leftlink=true', fragment='')

>>> urlparse.parse_qs(parse_result.query)
{'leftlink': ['true'], 'page': ['0'], 'companyId': ['41490234']}

The last line is a dictionary of key/value pairs.

Joran Beasley · Answer 2 · 2015-02-19T22:31:27.783

4

regex might not be the best solution to this problem ...why not just

my_url.split("?",1)

if that is truly all you wish to do

or as others have suggested

from urlparse import urlparse
print urlparse(my_url)

edited Feb 19 '15 at 22:31

answered Feb 19 '15 at 22:23

Joran Beasley

110,522
12
160
179

cause I want to parse and extract parts for not only url but also the query and the path. so there's url string as above, but also path string as '/company/Analytics/GetService' and also the query string as 'companyId=4343&type=0&page=11' – JudyJiang Feb 19 '15 at 22:26

score 2 · Accepted Answer · answered Feb 19 '15 at 22:26

2

This regex:

(^[^?]*$|(?<=\?).*)

captures:

^[^?]*$ everything, if there's no ?, or
(?<=\?).* everything after the ?, if there is one

However, you should look into urllib.parse (Python 3) or urlparse (Python 2) if you're working with URLs.

answered Feb 19 '15 at 22:26

Zero Piraeus

56,143
27
150
160

2

yes some famous saying about regular expressions comes to mind here (+1) – Joran Beasley Feb 19 '15 at 22:32

Parsing URL with regex

3 Answers3