Infer Spaces: Ignore Numbers and Special Characters

Question

I'm trying to pass over everything that isn't a letter (apostrophes, etc), and then continue on afterwards. The number should be in its respective place in the result. This is from this accepted answer, and the word list is here.

The string is "thereare7deadlysins"
The code below outputs "there are 7 d e a d l y s i n s"
I'm trying to get "there are 7 deadly sins"

I tried (below), but I receive IndexError: 'string index out of range'

# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
    if isinstance(s[i], int):
        continue
    c,k = best_match(i)
    assert c == cost[i]
    out.append(s[i-k:i])
    i -= k

The entire thing is:

from math import log 
import string

# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()

def infer_spaces(s):
    """Uses dynamic programming to infer the location of spaces in a string
    without spaces."""
    # Find the best match for the i first characters, assuming cost has
    # been built for the i-1 first characters.
    # Returns a pair (match_cost, match_length).
    def best_match(i):
        candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
        return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)

    # Build the cost array.
    cost = [0]
    for i in range(1,len(s)+1):
        c,k = best_match(i)
        cost.append(c)

    # Backtrack to recover the minimal-cost string.
    out = []
    i = len(s)
    while i>0:
        c,k = best_match(i)
        assert c == cost[i]
        out.append(s[i-k:i])
        i -= k

    return " ".join(reversed(out))

def test_trans(s):
    return s.translate(table, string.punctuation)


s = test_trans(l)
print(infer_spaces(s))

EDIT: Based on the accepted answer the following solved my problem:
1. Remove single letters from the wordlist (except a, e, i)
2. Added the following below wordcost.

nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
    wordcost[n] = log(2)

The suggestion to change wordcost to (below) did not produce optimal results.

wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )

Example:
String: "Recall8importantscreeningquestions"
Original wordcost: "recall 8 important screening questions"
Suggested wordcost: "re call 8 important s c re e n in g question s"

This is just a guess, but shouldn't that be `wordcost = dict((k, log((i+1)*log(len(k)))) for i,k in enumerate(words))` ? — PM 2Ring, Jan 24 '15 at 07:06

score 1 · Accepted Answer · answered Jan 24 '15 at 07:46

Note that the word list contains all 26 individual letters as words.

With just the following modifications, your algorithm will correctly infer the spaces for the input string "therearesevendeadlysins" (i.e. "7" changed to "seven"):

Remove the single letter words from the word list (perhaps except for "a" and "i".)
As @Pm 2Ring noted, change the definition of wordcost

to:

wordcost = wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )

So there is something about non-letters that is goofing up your algorithm. Since you have already removed punctuation, perhaps you should treat a string of non-letters as a single word.

For instance, if you add:

wordcost["7"] = log(2)

(in addition to changes 1 and 2 above) your algorithm works on the original test string.

score 0 · Answer 2 · answered Jan 24 '15 at 07:55

0

i = len(s) -1

to avoid IndexError: 'string index out of range' and

if s[i].isdigit():

is the test you're looking for.

answered Jan 24 '15 at 07:55

TennisVisuals

427
5
8

Infer Spaces: Ignore Numbers and Special Characters

2 Answers2