0

I am reading a badly formatted text, and often there are unwanted spaces inside a single word. For example, "int ernational trade is not good for economies" and so forth. Is there any efficient tool that can cope with this? (There are a couple of other answers like here, which do not work in a sentence.)

Edit: About the impossibility mentioned, I agree. One option is to preserve all possible options. In my case this edited text will be matched with another database that has the original (clean) text. This way, any wrong removal of spaces just gets tossed away.\

John s
  • 164
  • 7
  • I don't think this task is possible, even if you have a dictionary of valid words. For example, if you come across the string `in dent`, how do you know whether it's supposed to be two words as shown, or the single word `indent`? – dc-ddfe Feb 07 '23 at 00:24
  • Seems like a job for an NLP predictive text sort of solution — you’d decide which words to merge based on the probability of a given word following from the previous ones. – Samwise Feb 07 '23 at 00:28
  • See https://stackoverflow.com/questions/195010/how-can-i-split-multiple-joined-words which is kind of the opposite problem but I imagine the basic approach would be similar. – Samwise Feb 07 '23 at 00:30
  • 1
    You'll run into "what is a word"? because there are plenty of short words that are still valid if you remove the space between them, while also being completely wrong. E.g. you'd somehow have to know that "And then the ma called her kids" is not "And then thema called her kids", so you're looking at NLP ratings before and after, for n-grams to _try_ to rule out statistically unlikely word combinations. – Mike 'Pomax' Kamermans Feb 07 '23 at 00:30
  • Need NLP. You can't determine the significance of spaces as to how they relate to surrounding letters/numbers. Regex cannot match, or even come close to parsing language. – sln Feb 07 '23 at 01:12

1 Answers1

1

You could use the PyEnchant package to get a list of English words. I will assume words that do not have meaning on their own but do together are a word, and use the following code to find words that are split by a single space:

import enchant

text = "int ernational trade is not good for economies"
fixed_text = []

d = enchant.Dict("en_US")

for i in range(len(words := text.split())):
    if fixed_text and not d.check(words[i]) and d.check(compound_word := ''.join([fixed_text[-1], words[i]])):
        fixed_text[-1] = compound_word
    else:
        fixed_text.append(words[i])

print(' '.join(fixed_text))

This will split the text on spaces and append words to fixed_text. When it finds that a previously added word is not in the dictionary, but appending the next word to it does make it valid, it sticks those two words together.

This should help sanitize most of the invalid words, but as the comments mentioned it is sometimes impossible to find out if two words belong together without performing some sort of lexical analysis.

As suggested by Pranav Hosangadi, here is a modified (and a little more involved) version which can remove multiple spaces in words by compounding previously added words which are not in the dictionary. However, since a lot of smaller words are valid in the English language, many spaced out words don't correctly concatenate.

import enchant

text = "inte rnatio nal trade is not good for ec onom ies"
fixed_text = []

d = enchant.Dict("en_US")

for i in range(len(words := text.split())):
    if fixed_text and not d.check(compound_word := words[i]):
        for j, pending_word in enumerate(fixed_text[::-1], 1):
            if not d.check(pending_word) and d.check(compound_word := ''.join([pending_word, compound_word])):
                del fixed_text[-j:]
                fixed_text.append(compound_word)
                break
        else:
            fixed_text.append(words[i])
    else:
        fixed_text.append(words[i])

print(' '.join(fixed_text))
B Remmelzwaal
  • 1,581
  • 2
  • 4
  • 11
  • There doesn't seem to be any guarantee that a word will only be split in two – Pranav Hosangadi Feb 07 '23 at 01:07
  • The OP didn't specify clearly as their example only provided one excess space in one word, and I prefaced the code. However, not a lot has to change to allow for any number of spaces. – B Remmelzwaal Feb 07 '23 at 01:09
  • This replaces any whitespace with single spaces. [Regex idea](https://tio.run/##PU7LCsIwELznK8ZemlgRHxcRih9iekhs1ECTLduI@vU1UXAOOzPMzrLjO90p7g8jz7MPI3ECOyGSeyW0qHxMcBxN8hTNgMSmd/ATIiXciHpcieEuFCl4N1X/Irv19LBSIINrqa1@NkrqKY9TW7S29eqbDibY3iAcEc7bDk2mXYclZLHnTYdFi9rX6rddzgslxMj5M1mcmucP) where "correct" whitespace stays what it is. I used a fake removal criterion, can't test with `enchant`. – Kelly Bundy Feb 07 '23 at 01:38
  • (Hmm, just realized my suggestion also combines more than two consecutive words if just each consecutive pair passes the check. Not sure that's good or bad. Oh well, maybe still useful anyway.) – Kelly Bundy Feb 07 '23 at 01:41
  • @BRemmelzwaal, in almost all cases, there is only one white space. There are a few entries with multiple spaces, which I'm happy to throw away. – John s Feb 07 '23 at 04:06