-1

I'm trying to combine if else inside my regular expression, basically if some patterns exists in the string, capture one pattern, if not, capture another.

The string is: 'https://www.searchpage.com/searchcompany.aspx?companyId=41490234&page=0&leftlink=true" and I want to extract staff around the '?"

So if '?' is detected inside the string, the regular expression should capture everything after the '?' mark; if not, then just capture from the beginning.

I used:'(.*\?.*)?(\?.*&.*)|(^&.*)' But it didn't work...

Any suggestion?

Thanks!

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
JudyJiang
  • 2,207
  • 6
  • 27
  • 47
  • If you can guarantee that there won't be any other question marks later, you could use something like `r".*?\??([^?]+)"`. – Tom Hunt Feb 19 '15 at 22:18
  • thanks for reply. But this still captures the 'https://www.search..' part. But I actually want to capture it happens when there's no question mark detected.. – JudyJiang Feb 19 '15 at 22:20
  • 3
    Why not use [`urlparse`](https://docs.python.org/2/library/urlparse.html)? It allows you to get all the parts of the URL. – Peter Wood Feb 19 '15 at 22:21
  • possible duplicate of [Best way to parse a URL query string](http://stackoverflow.com/questions/10113090/best-way-to-parse-a-url-query-string) – Peter Wood Feb 20 '15 at 09:40

3 Answers3

5

Use urlparse:

>>> import urlparse
>>> parse_result = urlparse.urlparse('https://www.searchpage.com/searchcompany.aspx?
companyId=41490234&page=0&leftlink=true')

>>> parse_result
ParseResult(scheme='https', netloc='www.searchpage.com', 
path='/searchcompany.aspx', params='', 
query='companyId=41490234&page=0&leftlink=true', fragment='')

>>> urlparse.parse_qs(parse_result.query)
{'leftlink': ['true'], 'page': ['0'], 'companyId': ['41490234']}

The last line is a dictionary of key/value pairs.

Peter Wood
  • 23,859
  • 5
  • 60
  • 99
4

regex might not be the best solution to this problem ...why not just

my_url.split("?",1)

if that is truly all you wish to do

or as others have suggested

from urlparse import urlparse
print urlparse(my_url)
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
  • cause I want to parse and extract parts for not only url but also the query and the path. so there's url string as above, but also path string as '/company/Analytics/GetService' and also the query string as 'companyId=4343&type=0&page=11' – JudyJiang Feb 19 '15 at 22:26
2

This regex:

(^[^?]*$|(?<=\?).*)

captures:

  • ^[^?]*$ everything, if there's no ?, or
  • (?<=\?).* everything after the ?, if there is one

However, you should look into urllib.parse (Python 3) or urlparse (Python 2) if you're working with URLs.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160