0

Assume I have the following dictionary mapping of domain names to it's human readable description

domain_info = {"google.com" : "A Search Engine", 
               "facebook.com" : "A Social Networking Site", 
               "stackoverflow.com" : "Q&A Site for Programmers"}

I would like to get the description from response.url which returns an absolute path http://www.google.com/reader/view/

My current approach

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
domain = domain[-2:]       # ['google', 'com']
domain = ".".join(domain)  # 'google.com'
info = domain_info[domain]

seems to be too slow for large number of invocations, can anyone suggest an alternate way to speed things up?

An ideal solution would handle any subdomain and be case-insensitive

Penang
  • 1,305
  • 13
  • 18
  • Depending on what types of URL's you think you might get I would say either string operations or regular expressions. If you can elaborate on what type of URL's you might get and/or the larger project I might be able to help more. – Jordan Feb 27 '11 at 01:33
  • DNS names are interpreted in a case-insensitive way (see [this Wikipedia page](http://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax)), so the case makes no difference in any case :). (get it? the CASE makes no difference in any CASE :) ) – Abbafei Feb 27 '11 at 01:37
  • Related question: [http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url](http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url). Depending on the type of TLDs you're expecting to get, a regex should extract just the part you care about. – Bluu Feb 27 '11 at 01:40

4 Answers4

2

What does "too slow for large number of operations" mean? It's still going to work in constant time (for each URL) and you can't get any better than that. The above seems to be a perfectly good way to do it.

If you need it to be a bit faster (but it wouldn't be terribly faster), you could write your own regex. Something like "[a-zA-Z]+://([a-zA-Z0-9.]+)". That would get the full domain, not the subdomain. You would still need to do the domain splitting unless you can use lookahead in the regex to get just the last two segments. Be sure to use re.compile to make the regex itself fast.

Note that going domain[-2] is likely not going to be what you want. The logic of finding an appropriate "company level domain" is pretty complicated. For example, if the domain is google.com.au, this will give you "com.au" which is unlikely to be what you want -- you probably want "google.com.au".

As you say an ideal solution would handle any subdomain, you probably want to iterate over all the splits.

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
info = None
for i in range(len(domain)):
    subdomain = ".".join(domain[i:]) # 'www.google.com', 'google.com', 'com'
    try:
        info = domain_info[subdomain]
        break
    except KeyError:
        pass

With the above code, you will find it if it matches any subdomain. As for case sensitivity, that is easy. Ensure all the keys in the dictionary are lowercase, and apply .lower() to the domain before all the other processing.

mgiuca
  • 20,958
  • 7
  • 54
  • 70
  • "too slow for large number of operations" -> when invoked multiple times (100+) there is a noticeable impact performance. Sorry, I could have done a better job at explaining that – Penang Feb 27 '11 at 01:53
  • @Penang, how noticable? Don't know if this helps, but I tested a modified version of the above, which parsed 10000 randomly generated url strings in less than a second on an atom processor. – senderle Feb 27 '11 at 02:46
  • @senderle It's part of spider that uses the Scrapy framework. When I simply set all the descriptions to "" and comment out the above lines it found items right away, as opposed to 10-12 second lag. Any recommendations as to how I could get empirical data (profile) this program within Eclipse/pydev are always welcomed. This is still unfamiliar territory to me – Penang Feb 27 '11 at 03:07
  • Rather than disabling all of it, try just disabling the use of urllib and just hard-code domain to "www.google.com" and see what the speed is. This might give you a clue as to how slow urllib is being, which is the only part of the algorithm you might get rid of. – mgiuca Feb 27 '11 at 05:10
1

It seems like the urlparse.py in the Python 2.6 standard library does a bunch of things when calling the urlparse() function. It may be possible to speed things up by writing a little URL parser which does only what is absolutely necessary and no more.

UPDATE: see this part of Wikipedia's page about DNS for information on the syntax of domain names, it may give some ideas for the parser.

Abbafei
  • 3,088
  • 3
  • 27
  • 24
1

You may consider extracting the domain without sub-domains using regular expression:

'http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)'

import re
m = re.search('http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)', 'http://www.google.com/asd?#a')
print m.group(2)
Viet
  • 17,944
  • 33
  • 103
  • 135
  • Forgive my ignorance -- I'm not so good with regular expressions -- but it seems like you don't need to escape forward slashes, do you? – senderle Feb 27 '11 at 02:14
1

You can use some of the work that urlparse does. Try to look things up directly by the netloc it returns and only fall back on the split/join if you must:

def normalize( domain ):
    domain = domain.split(".") # ['www', 'google', 'com']
    domain = domain[-2:]       # ['google', 'com']
    return ".".join(domain)  # 'google.com'


# caches the netlocs that are not "normal"
aliases = {}

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc

    if netloc in aliases:
        return domain_info[aliases[netloc]]

    if netloc in domain_info:
        return domain_info[netloc]

    main = normalize(netloc)
    if main in domain_info:
        aliases[netloc] = main
        return domain_info[netloc]

Same thing with a caching lib:

from beaker.cache import CacheManager
netlocs = CacheManager(namespace='netloc')

@netlocs.cache()
def getloc( domain ):
    try:
        return domain_info[domain]
    except KeyError:
        domain = domain.split(".")
        domain = domain[-2:]
        domain = ".".join(domain)
        return domain_info[domain]

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc
    return getloc( netloc )

Maybe it helps a bit, but it really depends on the variety of urls you have.

Jochen Ritzel
  • 104,512
  • 31
  • 200
  • 194