3

I have a line of strings:

"specificationsinaccordancewithqualityaccreditedstandards"

Which needs to be split into tokenized words such as:

"specifications in accordance with quality accredited standards"

I have tried nltk's word_tokenize but it was not able to convert,

Context: I am parsing a PDF document into text file, and this is the text which I am getting back from the pdf converter, to convert pdf into text I am using PDFminer in Python

Ajax1234
  • 69,937
  • 8
  • 61
  • 102
Subhajeet Dey
  • 100
  • 3
  • 7
  • Is there another PDF converter you can try? It shouldn't be jamming all the words together like that. – sniperd Sep 05 '17 at 13:08
  • 3
    you're most likely going to run into problems with ambiguity. For example: is the first word in that string "specific" (followed by "at" and "ion", both unique, valid words) or "specification"? – Zinki Sep 05 '17 at 13:08
  • Did you try brute forcing your way by searching for all the words in the dictionary? Pretty sure you can find a library with all the words and in the English dictionary. –  Sep 05 '17 at 13:10
  • Yes this is what led me to find the solution, thanks – Subhajeet Dey Sep 05 '17 at 16:14

2 Answers2

4

You can use recursion to solve this problem. First, you will want to download a dictionary txt file, which you can get here: https://github.com/Ajax12345/My-Python-Projects/blob/master/the_file.txt

dictionary = [i.strip('\n') for i in open('the_file.txt')]
def get_options(scrambled, flag, totals, last):
   if flag:
       return totals

   else:
       new_list = [i for i in dictionary if scrambled.startswith(i)]
       if new_list:

           possible_word = new_list[-1]
           new_totals = totals
           new_totals.append(possible_word)
           new_scrambled = scrambled[len(possible_word):]
           return get_options(new_scrambled, False, new_totals, possible_word)

        else:
            return get_options("", True, totals, '')


s = "specificationsinaccordancewithqualityaccreditedstandards"
print(' '.join(get_options(s, False, [], '')))

Output:

'specifications in accordance with quality accredited standards'
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • This is what i was looking for, thanks, also the dictionary can be dynamic as well and contain words which we already have found – Subhajeet Dey Sep 05 '17 at 16:12
3

You could use a trie. A trie is a data structure that allows words validation.
It is a tree, in which you navigate a branch for valid prefixes, and you get notified when you hit a full world.

Although I have never used it "concretely", I found this python implementation, datrie.

My thought would be to import datrie, use it to generate a trie from a txt dictionary (e.g. here) and then parse the string. Read character per character while you find matches in the trie, and when you don't you've reasonably found a word, then add it to the split words string.

You can find more on trie here on wikipedia or in this video (which is the one who taught me what a trie is).

magicleon94
  • 4,887
  • 2
  • 24
  • 53