Python: how to automatically spellcheck and correct joined words such as "reportthatexplains" and "havebeen"

Question

I have some large text files which are in correct English because extracted from pdfs. However, many words in these text files are joined: "informationotherwise", "havebeen", "reportthatexplains". Every spell checker will spot these errors, e.g. LanguageTool, Sublime, MS-Word. However, Python struggles.

I tried pyspellchecker and TextBlob to check and correct these words, but, alas, to no avail.

See for example this code, which returns None three times.

misspelled = spell.unknown(["informationotherwise", "havebeen", "reportthatexplains"])

for word in misspelled:
    print(spell.correction(word))
    print(spell.candidates(word))

And this code:

t ="havebeen"
TextBlob(t).correct().string

>>> 'havebeen'

Any suggestions?

This might help https://stackoverflow.com/q/13928155/5666087 — jkr, Sep 09 '22 at 00:46
Nope `print(spell('havebeen'))` does not alter the word. Neither does Jamspell — Martien Lubberink, Sep 09 '22 at 00:55

score 2 · Accepted Answer · answered Sep 09 '22 at 05:06

2

Use word ninja library for splitting long word into sub word

import wordninja
word  = ["informationotherwise", "havebeen", "reportthatexplains"]
for x in word :
    print(' '.join(wordninja.split(x)))

 #op
 information otherwise
 have been
 report that explains

answered Sep 09 '22 at 05:06

qaiser

2,770
2
17
29

1

Also this site helps: Wordsegment "https://grantjenks.com/docs/wordsegment/ – Martien Lubberink Oct 03 '22 at 00:21

Python: how to automatically spellcheck and correct joined words such as "reportthatexplains" and "havebeen"

1 Answers1