Regex not matching URL in Python

Question

Possible Duplicate:
how to extract domain name from URL

I want to extract the website from an URL, i.e. console.aws.amazon.com from the following URL.

>>> ts
'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> re.match(ts,'(")?http(s)?://(.*?)/').group(0)

Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
re.match(ts,'(")?http(s)?://(.*?)/').group(0)
AttributeError: 'NoneType' object has no attribute 'group'

I tried this regular expression in JS and it worked. Any idea why this matches in JS, but it doesn't work in Python?

Regex or regexp if you like, but not regrex. Short for _Reg_ ular _Ex_ pression. — dschulz, Jan 09 '13 at 02:29
Vote for reopen - as this specific question is asking for a regular expression to extract the domain. The comment below the answer clarifies why urlparse is not ideal *in this case* - namely that an exe will be exported, and the less includes the better. — Josh Smeaton, Jan 10 '13 at 01:04

score 5 · Accepted Answer · answered Jan 09 '13 at 02:28

5

You are doing your match incorrect. Python doco say's:

re.match(pattern, string, flags=0)

You are doing:

re.match(string, pattern)

So simply change it to:

 re.match('(")?http(s)?://(.*?)/', ts).group(0)

answered Jan 09 '13 at 02:28

Ruben

1,427
3
17
25

OK, that's the root cause. :) – Shawn Zhang Jan 09 '13 at 02:35
Glad you solved it ;) Although using existing tools like the peeps are suggesting below is defiantly something you should look at. Don't write stuff yourself if it already exists ;) – Ruben Jan 09 '13 at 03:14
Why are you encouraging it then if you're recommending "don't write stuff yourself if it already exists"? – hd1 Jan 09 '13 at 03:23
Because it is a solution to the problem. The other answer are alternatives (not solutions) for the problem Shawn is having. – Ruben Jan 09 '13 at 03:34
while it is **a** solution, @ShawnZhang should be using urlparse, which is intended for **precisely** this purpose, instead of going through some convoluted regexp developed by a random internet user. – hd1 Jan 09 '13 at 07:31
Hi, both . First of all thanks to your kindness and critical think on how to apply python in best practice. In the url parse case, urlparse is great , but I'm try to export to a executable file which require less size and re is used in elsewhere of code, so in this case , re is better solution . Thanks to you all @hd12 – Shawn Zhang Jan 09 '13 at 07:43

score 5 · Answer 2 · answered Jan 09 '13 at 02:32

Use urlparse

>>> from urlparse import urlparse
>>> u = 'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> p = urlparse(u)
>>> p
ParseResult(scheme='https', netloc='console.aws.amazon.com', path='/ec2/home', params='', query='region=us-east-1', fragment='s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806')
>>> p.netloc
'console.aws.amazon.com'
>>>

score 0 · Answer 3 · answered Jan 09 '13 at 02:25

0

You could always use the str.partition method for this:

print(ts.partition('//')[2].partition('/')[0])
>>> console.aws.amazon.com

Regular expressions is a bit overkill for this.

answered Jan 09 '13 at 02:25

Volatility

31,232
10
80
89

Even your solution is *a bit overkill* as the urlparse module exists for **precisely** this purpose. – hd1 Jan 09 '13 at 03:24

Regex not matching URL in Python

3 Answers3