How to convert line of text into meaningful words

Question

I have a line of strings:

"specificationsinaccordancewithqualityaccreditedstandards"

Which needs to be split into tokenized words such as:

"specifications in accordance with quality accredited standards"

I have tried nltk's word_tokenize but it was not able to convert,

Context: I am parsing a PDF document into text file, and this is the text which I am getting back from the pdf converter, to convert pdf into text I am using PDFminer in Python

Is there another PDF converter you can try? It shouldn't be jamming all the words together like that. — sniperd, Sep 05 '17 at 13:08
you're most likely going to run into problems with ambiguity. For example: is the first word in that string "specific" (followed by "at" and "ion", both unique, valid words) or "specification"? — Zinki, Sep 05 '17 at 13:08
Did you try brute forcing your way by searching for all the words in the dictionary? Pretty sure you can find a library with all the words and in the English dictionary. — , Sep 05 '17 at 13:10

score 4 · Accepted Answer · answered Sep 05 '17 at 14:33

You can use recursion to solve this problem. First, you will want to download a dictionary txt file, which you can get here: https://github.com/Ajax12345/My-Python-Projects/blob/master/the_file.txt

dictionary = [i.strip('\n') for i in open('the_file.txt')]
def get_options(scrambled, flag, totals, last):
   if flag:
       return totals

   else:
       new_list = [i for i in dictionary if scrambled.startswith(i)]
       if new_list:

           possible_word = new_list[-1]
           new_totals = totals
           new_totals.append(possible_word)
           new_scrambled = scrambled[len(possible_word):]
           return get_options(new_scrambled, False, new_totals, possible_word)

        else:
            return get_options("", True, totals, '')


s = "specificationsinaccordancewithqualityaccreditedstandards"
print(' '.join(get_options(s, False, [], '')))

Output:

'specifications in accordance with quality accredited standards'

This is what i was looking for, thanks, also the dictionary can be dynamic as well and contain words which we already have found — Subhajeet Dey, Sep 05 '17 at 16:12

score 3 · Answer 2 · answered Sep 05 '17 at 13:19

You could use a trie. A trie is a data structure that allows words validation.
It is a tree, in which you navigate a branch for valid prefixes, and you get notified when you hit a full world.

Although I have never used it "concretely", I found this python implementation, datrie.

My thought would be to import datrie, use it to generate a trie from a txt dictionary (e.g. here) and then parse the string. Read character per character while you find matches in the trie, and when you don't you've reasonably found a word, then add it to the split words string.

You can find more on trie here on wikipedia or in this video (which is the one who taught me what a trie is).

How to convert line of text into meaningful words

2 Answers2