14

I'm quite new to python. I'm trying to parse a file of URLs to leave only the domain name.

some of the urls in my log file begin with http:// and some begin with www.Some begin with both.

This is the part of my code which strips the http:// part. What do I need to add to it to look for both http and www. and remove both?

line = re.findall(r'(https?://\S+)', line)

Currently when I run the code only http:// is stripped. if I change the code to the following:

line = re.findall(r'(https?://www.\S+)', line)

Only domains starting with both are affected. I need the code to be more conditional. TIA

edit... here is my full code...

import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], "r")

for line in f.readlines():
 line = re.findall(r'(https?://\S+)', line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

I mistagged by original post as regex. it is indeed using urlparse.

Paul Tricklebank
  • 229
  • 2
  • 3
  • 11
  • 3
    Just a note: You do realise that `www.domain.com` is *different* from `domain.com`, right, and may point at wildly different IPs? – J. Steen Jan 31 '13 at 12:23
  • What about the domains `www.www.com` and `www.com`? – Matthias Jan 31 '13 at 12:30
  • Duplicate: http://stackoverflow.com/questions/1521592/get-root-domain-of-link – Alex L Jan 31 '13 at 12:31
  • Duplicate: http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url I'll delete my existing post now that I can comment :) – wei2912 Mar 05 '13 at 14:42

6 Answers6

21

It might be overkill for this specific situation, but i'd generally use urlparse.urlsplit (Python 2) or urllib.parse.urlsplit (Python 3).

from urllib.parse import urlsplit  # Python 3
from urlparse import urlsplit  # Python 2
import re

url = 'www.python.org'

# URLs must have a scheme
# www.python.org is an invalid URL
# http://www.python.org is valid

if not re.match(r'http(s?)\:', url):
    url = 'http://' + url

# url is now 'http://www.python.org'

parsed = urlsplit(url)

# parsed.scheme is 'http'
# parsed.netloc is 'www.python.org'
# parsed.path is None, since (strictly speaking) the path was not defined

host = parsed.netloc  # www.python.org

# Removing www.
# This is a bad idea, because www.python.org could 
# resolve to something different than python.org

if host.startswith('www.'):
    host = host[4:]
Markus Unterwaditzer
  • 7,992
  • 32
  • 60
9

You can do without regexes here.

with open("file_path","r") as f:
    lines = f.read()
    lines = lines.replace("http://","")
    lines = lines.replace("www.", "") # May replace some false positives ('www.com')
    urls = [url.split('/')[0] for url in lines.split()]
    print '\n'.join(urls)

Example file input:

http://foo.com/index.html
http://www.foobar.com
www.bar.com/?q=res
www.foobar.com

Output:

foo.com
foobar.com
bar.com
foobar.com

Edit:

There could be a tricky url like foobarwww.com, and the above approach would strip the www. We will have to then revert back to using regexes.

Replace the line lines = lines.replace("www.", "") with lines = re.sub(r'(www.)(?!com)',r'',lines). Of course, every possible TLD should be used for the not-match pattern.

siddharthlatest
  • 2,237
  • 1
  • 20
  • 24
6

I came across the same problem. This is a solution based on regular expressions:

>>> import re
>>> rec = re.compile(r"https?://(www\.)?")

>>> rec.sub('', 'https://domain.com/bla/').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'https://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://domain.com/bla/    ').strip().strip('/')
'domain.com/bla'

>>> rec.sub('', 'http://www.domain.com/bla/    ').strip().strip('/')
'domain.com/bla'
thet
  • 697
  • 12
  • 14
4

Check out the urlparse library, which can do these things for you automatically.

>>> urlparse.urlsplit('http://www.google.com.au/q?test')
SplitResult(scheme='http', netloc='www.google.com.au', path='/q', query='test', fragment='')
Alex L
  • 8,748
  • 5
  • 49
  • 75
Tom
  • 341
  • 4
  • 8
1

You can use urlparse. Also, the solution should be generic to remove things other than 'www' before the domain name (i.e., handle cases like server1.domain.com). The following is a quick try that should work:

from urlparse import urlparse

url = 'http://www.muneeb.org/files/alan_turing_thesis.jpg'

o = urlparse(url)

domain = o.hostname

temp = domain.rsplit('.')

if(len(temp) == 3):
    domain = temp[1] + '.' + temp[2]

print domain 
Muneeb Ali
  • 2,056
  • 1
  • 16
  • 8
0

I believe @Muneeb Ali is the nearest to the solution but the problem appear when is something like frontdomain.domain.co.uk....

I suppose:

for i in range(1,len(temp)-1):
    domain = temp[i]+"."
domain = domain + "." + temp[-1]

Is there a nicer way to do this?

Claudiu
  • 577
  • 1
  • 9
  • 24