0

I have some large text files which are in correct English because extracted from pdfs. However, many words in these text files are joined: "informationotherwise", "havebeen", "reportthatexplains". Every spell checker will spot these errors, e.g. LanguageTool, Sublime, MS-Word. However, Python struggles.

I tried pyspellchecker and TextBlob to check and correct these words, but, alas, to no avail.

See for example this code, which returns None three times.

misspelled = spell.unknown(["informationotherwise", "havebeen", "reportthatexplains"])

for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

And this code:

t ="havebeen"
TextBlob(t).correct().string

>>> 'havebeen'

Any suggestions?

Martien Lubberink
  • 2,614
  • 1
  • 19
  • 31

1 Answers1

2

Use word ninja library for splitting long word into sub word

import wordninja
word  = ["informationotherwise", "havebeen", "reportthatexplains"]
for x in word :
    print(' '.join(wordninja.split(x)))

 #op
 information otherwise
 have been
 report that explains
qaiser
  • 2,770
  • 2
  • 17
  • 29