0

My problem is the following. I have a long list of URLs such as:

www.foo.com/davidbobmike1joe
www.foo.com/mikejoe2bobkarl
www.foo.com/joemikebob
www.foo.com/bobjoe

I need to compare all the entries (URLs) in that list with each other, extract the keywords in the subdomains of those URLs (in this case: david, joe, bob, mike, karl) and order them by frequency. I've been reading about several libraries such as nltk. However the problem here is that there are no spaces to tokenise each word independently. Any recommendations on how to get the job done?

Stephane Rolland
  • 38,876
  • 35
  • 121
  • 169
  • why not use a space, and then use BASE64 to convert them ? – Abdelouahab Pp Mar 12 '13 at 21:44
  • Without delimiters nor keywords, it's pretty much imposible to separate the words, you can't make a machine arbitrarily guess how you want to split the text :/ you need SOME kind of context. – asermax Mar 12 '13 at 21:48
  • You're right. However I think this case is different since I can compare each URL with one another and determine which set of characters are being repeated more often to extract those keywords –  Mar 12 '13 at 21:54
  • You will still have the same problem: how many letters do you use to determine a word? 3? 4? 10? Will you try incrementing the amount of letters by one and iterating over all urls? You will match 'mi', 'mik','mike', etc and they will be all valid matches. And so on. I insist, it will be very hard to get this working without some kind of context. – asermax Mar 12 '13 at 22:06
  • This seems popular today.... - last related post: http://stackoverflow.com/questions/15364975/is-there-an-easy-way-generate-a-probable-list-of-words-from-an-unspaced-sentence#comment21709704_15364975 – Jon Clements Mar 13 '13 at 00:31

2 Answers2

1

Overview

You could use this code to extract the names, passing in a list of [david, bob, etc.]:

Is there an easy way generate a probable list of words from an unspaced sentence in python?

And then use collections.Counter to get frequencies.

The code

from Bio import trie
import string
from collections import Counter


def get_trie(words):
    tr = trie.trie()
    for word in words:
        tr[word] = len(word)
    return tr

def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def get_trie_words(s):
    names = ['david', 'bob', 'karl', 'joe', 'mike']
    tr = get_trie(names)
    while s:
        word, s = get_trie_word(tr, s)
        yield word

def main(urls):
    d = Counter()
    for url in urls:
        url = "".join(a for a in url if a in string.lowercase)
        for word in get_trie_words(url):
            d[word] += 1
    return d

if __name__ == '__main__':
    urls = [
        "davidbobmike1joe",
        "mikejoe2bobkarl",
        "joemikebob",
        "bobjoe",
    ]
    print main(urls)

Results

Counter({'bob': 4, 'joe': 4, 'mike': 3, 'karl': 1, 'david': 1})
Community
  • 1
  • 1
hughdbrown
  • 47,733
  • 20
  • 85
  • 108
  • Hi, the problem is there are tons of URLs and it would be impossible to create an apriori keyword list to match the terms. –  Mar 12 '13 at 21:43
  • Oh and it would be better if no dictionary was used –  Mar 12 '13 at 21:45
  • Then I don't get it: any library you use is going to have some way of figuring out the keywords, and a dictionary of words seems most promising. if a dictionary is ruled out (as I used here: http://stackoverflow.com/questions/15364975/is-there-an-easy-way-generate-a-probable-list-of-words-from-an-unspaced-sentence/15367466#15367466), how do you imagine the keywords will be generated? – hughdbrown Mar 12 '13 at 21:55
  • you can use the facebook name list, it is avaible as torrent since it is big. – Abdelouahab Pp Mar 12 '13 at 22:02
1

Limitations

If you refuse to use a dictionary you're algorithm will require a lot of computation. Above that, it is impossible to distinguish a keyword that occurs only once (e.g: "karl") from a crappy sequence (e.g: "e2bo"). My solution will be a best effort and will only work if your list of URL's contains keywords multiple times.

The basic idea

I assume a word is a sequence of characters that occur frequently of at least 3 characters. This prevents the letter "o" from being the most popular word.

The basic idea is the following.

  • Count all n letter sequences and select the once that occur multiple times.
  • Cut all sequences that are a part of a larger sequence.
  • Order them by popularity and you have a solution that comes close to solving your problem. (Left as an exercise to the reader)

In code

import operator

sentences = ["davidbobmike1joe" , "mikejoe2bobkarl", "joemikebob", "bobjoe", "bobbyisawesome", "david", "bobbyjoe"];
dict = {}

def countWords(n):
    """Count all possible character sequences/words of length n occuring in all given sentences"""
    for sentence in sentences:
        countWordsSentence(sentence, n);

def countWordsSentence(sentence, n):
    """Count all possible character sequence/words of length n occuring in a sentence"""
    for i in range(0,len(sentence)-n+1):
        word = sentence[i:i+n]
        if word not in dict:
            dict[word] = 1;
        else:
            dict[word] = dict[word] +1;

def cropDictionary():
    """Removes all words that occur only once."""
    for key in dict.keys():
        if(dict[key]==1):
            dict.pop(key);

def removePartials(word):
    """Removes all the partial occurences of a given word from the dictionary."""
    for i in range(3,len(word)):
        for j in range(0,len(word)-i+1):
            for key in dict.keys():
               if key==word[j:j+i] and dict[key]==dict[word]:
                   dict.pop(key);

def removeAllPartials():
    """Removes all partial words in the dictionary"""
    for word in dict.keys():
        removePartials(word);

for i in range(3,max(map(lambda x: len(x), sentences))):
    countWords(i);
    cropDictionary();
    removeAllPartials();

print dict;

Output

>>> print dict;
{'mike': 3, 'bobby': 2, 'david': 2, 'joe': 5, 'bob': 6}

Some challenges to the reader

  • Sort the dictionary by value before printing it. (Sort a Python dictionary by value)
  • In this example "bob" occurs six times, 2 times it is a partial word of "bobby". Determine if this is problematic and fix it if necessary.
  • Take capitalization into account.
Community
  • 1
  • 1
Erik
  • 755
  • 1
  • 5
  • 17
  • Suggestions: (1) don't name variable `dict` (2) use `collections.defaultdict(int)` instead of `dict()` (3) death to semicolons. I'll try the code and +1 it later. Someone has to. – hughdbrown Mar 13 '13 at 00:24
  • This is exactly what I was looking for. Thank you very much. I'll try the code an give an update –  Mar 13 '13 at 07:04