I want to regex a list of URLs.
The links format looks like this:
`https://en.wikipedia.org/wiki/Alexander_Pushkin'
The part I need:
en.wikipedia.org
Can you help, please?
Instead of looking for \w
etc. which would only match the domain, you're effectively looking for anything up to where the URL arguments start (the first ?
):
re.search(r'[^?]*', URL)
This means: from the beginning of the string (search
), all characters that are not ?
. A character class beginning with ^
negates the class, i.e. not matching instead of matching.
This gives you a match object, where [0]
will be the URL you're looking for.
You can do that wihtout using regex by leveraging urllib.parse.urlparse
from urllib.parse import urlparse
url = "https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB"
parsed_url = urlparse(url)
print(f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}")
Outputs
Based on: How to match "anything up until this sequence of characters" in a regular expression?
.+?(?=\?)
so:
re.findall(".+?(?=\?)", URL)