I'm trying to pass over everything that isn't a letter (apostrophes, etc), and then continue on afterwards. The number should be in its respective place in the result. This is from this accepted answer, and the word list is here.
The string is "thereare7deadlysins"
The code below outputs "there are 7 d e a d l y s i n s"
I'm trying to get "there are 7 deadly sins"
I tried (below), but I receive IndexError: 'string index out of range'
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
if isinstance(s[i], int):
continue
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
The entire thing is:
from math import log
import string
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = open("/Users/.../Desktop/wordlist.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
table = string.maketrans("","")
l = "".join("thereare7deadlysins".split()).lower()
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
def test_trans(s):
return s.translate(table, string.punctuation)
s = test_trans(l)
print(infer_spaces(s))
EDIT: Based on the accepted answer the following solved my problem:
1. Remove single letters from the wordlist (except a, e, i)
2. Added the following below wordcost.
nums = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
for n in nums:
wordcost[n] = log(2)
The suggestion to change wordcost to (below) did not produce optimal results.
wordcost = dict( (k, (i+1)*log(1+len(k))) for i,k in enumerate(words) )
Example:
String: "Recall8importantscreeningquestions"
Original wordcost: "recall 8 important screening questions"
Suggested wordcost: "re call 8 important s c re e n in g question s"