Breaking a string into individual words in Python

Question

I have a large list of domain names (around six thousand), and I would like to see which words trend the highest for a rough overview of our portfolio.

The problem I have is the list is formatted as domain names, for example:

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

Just running a word count brings up garbage. So I guess the simplest way to go about this would be to insert spaces between whole words then run a word count.

For my sanity I would prefer to script this.

I know (very) little python 2.7 but I am open to any recommendations in approaching this, example of code would really help. I have been told that using a simple string trie data structure would be the simplest way of achieving this but I have no idea how to implement this in python.

How do you decide where the word boundaries are? Do you have a dictionary of words you're expecting? What if two different splits are possible? — Tim Pietzcker, Aug 01 '11 at 10:42
The portfolio is comprised of 100% English words as far as I can tell, so I guessed I would need to run it against a full English dictionary?. I'm fine with a certain amount of false results as this can be reviewed afterwards. — Christopher Long, Aug 01 '11 at 10:46
[Possible duplicate](http://stackoverflow.com/questions/1538589/how-to-sort-all-possible-words-out-of-a-string) — Lauritz V. Thaulow, Aug 01 '11 at 11:04

score 6 · Accepted Answer · edited May 23 '17 at 11:51

We try to split the domain name (s) into any number of words (not just 2) from a set of known words (words). Recursion ftw!

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

This iterator function first yields the string it is called with if it is in words. Then it splits the string in two in every possible way. If the first part is not in words, it tries the next split. If it is, the first part is prepended to all the results of calling itself on the second part (which may be none, like in ["example", "cart", ...])

Then we build the english dictionary:

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

Now we can put things together:

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

Result: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

Using a set to contain the english dictionary makes for fast membership checks. -= removes items from the set, |= adds to it.

Using the all function together with a generator expression improves efficiency, since all returns on the first False.

Some substrings may be a valid word both as either a whole or split, such as "example" / "ex" + "ample". For some cases we can solve the problem by excluding unwanted words, such as "ex" in the above code example. For others, like "pensions" / "pens" + "ions", it may be unavoidable, and when this happens, we need to prevent all the other words in the string from being counted multiple times (once for "pensions" and once for "pens" + "ions"). We do this by keeping track of the found words of each domain name in a set -- sets ignore duplicates -- and then count the words once all have been found.

EDIT: Restructured and added lots of comments. Forced strings to lower case to avoid misses because of capitalization. Also added a list to keep track of domain names where no combination of words matched.

NECROMANCY EDIT: Changed substring function so that it scales better. The old version got ridiculously slow for domain names longer than 16 characters or so. Using just the four domain names above, I've improved my own running time from 3.6 seconds to 0.2 seconds!

Thanks for your reply, I have to admit a lot of this is over my head but I'll play about with it and see if I can get it working. — Christopher Long, Aug 01 '11 at 11:57
Probably a stupid question, wouldn't I need to define domains? — Christopher Long, Aug 01 '11 at 13:14
@Cristopher: Yes, I've updated my answer with a clarification on that point. — Lauritz V. Thaulow, Aug 01 '11 at 13:23
So this is taking the domain strings from a list i need to add to the code, rather than importing it from a txt file?. Sorry if my questions are tedious, i really appreciate the time and help you have given me. — Christopher Long, Aug 01 '11 at 13:38
@Christopher: Yes, for example: `domains = [x.strip() for x in open("domains.txt").readlines()]`. `strip()` removes extra whitespace (spaces, tabs and newlines) from the beginning and end of each line. It's just to be on the safe side, `domains = open("domains.txt").readlines()` may suffice. — Lauritz V. Thaulow, Aug 01 '11 at 13:44
@Christopher I've just improved this answer, in case you're interested. — Lauritz V. Thaulow, Mar 25 '12 at 12:14

wim · Answer 2 · 2013-10-31T13:29:26.220

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

Here's a brute force method which only tries to split the domains into 2 english words. If the domain doesn't split into 2 english words, it gets junked. It should be straightforward to extend this to attempt more splits, but it will probably not scale well with the number of splits unless you be clever. Fortunately I guess you'll only need 3 or 4 splits max.

output:

deals: 1
example: 2
pensions: 1

Thanks for your reply, however lazyr is correct, I'm after a count of the individual words that make up the domain. — Christopher Long, Aug 01 '11 at 10:52
ok sorry, that's a much harder problem because of the ambiguity. i'll modify my code appropriately.. — wim, Aug 01 '11 at 11:47

Rusty Rob · Answer 3 · 2011-08-01T12:12:28.243

assuming you only have a few thousand standard domains you should be able to do this all in memory.

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

Breaking a string into individual words in Python

3 Answers3

Linked