0

Possible Duplicate:
how to extract domain name from URL

I want to extract the website from an URL, i.e. console.aws.amazon.com from the following URL.

>>> ts
'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> re.match(ts,'(")?http(s)?://(.*?)/').group(0)

Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
re.match(ts,'(")?http(s)?://(.*?)/').group(0)
AttributeError: 'NoneType' object has no attribute 'group'

I tried this regular expression in JS and it worked. Any idea why this matches in JS, but it doesn't work in Python?

Community
  • 1
  • 1
Shawn Zhang
  • 1,680
  • 1
  • 12
  • 21
  • Regex or regexp if you like, but not regrex. Short for _Reg_ ular _Ex_ pression. – dschulz Jan 09 '13 at 02:29
  • Vote for reopen - as this specific question is asking for a regular expression to extract the domain. The comment below the answer clarifies why urlparse is not ideal *in this case* - namely that an exe will be exported, and the less includes the better. – Josh Smeaton Jan 10 '13 at 01:04

3 Answers3

5

You are doing your match incorrect. Python doco say's:

re.match(pattern, string, flags=0)

You are doing:

re.match(string, pattern)

So simply change it to:

 re.match('(")?http(s)?://(.*?)/', ts).group(0)
Ruben
  • 1,427
  • 3
  • 17
  • 25
  • OK, that's the root cause. :) – Shawn Zhang Jan 09 '13 at 02:35
  • Glad you solved it ;) Although using existing tools like the peeps are suggesting below is defiantly something you should look at. Don't write stuff yourself if it already exists ;) – Ruben Jan 09 '13 at 03:14
  • Why are you encouraging it then if you're recommending "don't write stuff yourself if it already exists"? – hd1 Jan 09 '13 at 03:23
  • Because it is a solution to the problem. The other answer are alternatives (not solutions) for the problem Shawn is having. – Ruben Jan 09 '13 at 03:34
  • while it is **a** solution, @ShawnZhang should be using urlparse, which is intended for **precisely** this purpose, instead of going through some convoluted regexp developed by a random internet user. – hd1 Jan 09 '13 at 07:31
  • Hi, both . First of all thanks to your kindness and critical think on how to apply python in best practice. In the url parse case, urlparse is great , but I'm try to export to a executable file which require less size and re is used in elsewhere of code, so in this case , re is better solution . Thanks to you all @hd12 – Shawn Zhang Jan 09 '13 at 07:43
5

Use urlparse

>>> from urlparse import urlparse
>>> u = 'https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806'
>>> p = urlparse(u)
>>> p
ParseResult(scheme='https', netloc='console.aws.amazon.com', path='/ec2/home', params='', query='region=us-east-1', fragment='s=Instances,EC2 Management Console,12/3/2012 4:34:57 PM,11,0,,25806')
>>> p.netloc
'console.aws.amazon.com'
>>> 
Josh Smeaton
  • 47,939
  • 24
  • 129
  • 164
0

You could always use the str.partition method for this:

print(ts.partition('//')[2].partition('/')[0])
>>> console.aws.amazon.com

Regular expressions is a bit overkill for this.

Volatility
  • 31,232
  • 10
  • 80
  • 89
  • Even your solution is *a bit overkill* as the urlparse module exists for **precisely** this purpose. – hd1 Jan 09 '13 at 03:24