I have a few thousand badly parsed text files that show some interesting behavior for somewhere between 10% and 30% of their length. I unfortunately do not have the original data, so I can't attempt to re-parse, but pretty much every file needs to be (partially cleaned)
Example Input
text = 'The European l a n g u a g es ar e members of the same fa m i l y
. Their sep a rate e xi ste nce is a myth . F or s c i e n c e , music,
sport , etc, Europe uses the s a m e v oca bula ry. The languages o n l y d
i f f e r i n t heir grammar, their pro nu n c iation and their most common
words. Everyone realizes why a new common language would be desirable: one could
refuse to pay expensive translators.'
Expected Output
'The European languages are members of the same family. Their separate existence
i s a myth. For science, music, sport, etc, Europe uses the same vocabulary. The
languages only differ in their grammar, their pronunciation and their most
common words. Everyone realizes why a new common language would be desirable:
one could refuse to pay expensive translators.'
There does not seem to be much regularity from one weird formatting to another, and no clear "cause" or trigger words or symbols. Just one thing I noticed: The words in strange formatting are separated by two spaces (except sometimes before punctuation, but that is a simple text.replace(' ,',',')
).
Question
How do I remove all the spaces from a string that are bracketed between pairs of double spaces? I assume there is a regex that I just haven't thought about...
Some more Info
I do not know how many of these weird parts/letters there are per document, and I do not know the content of the documents. The only other things I am reasonably certain of are:
- shortest fragment length is 1 character ("members" could be "m e m b e r s") and may be much longer (such as in "anticip ated")
- punctuation may be preceeded by a single space, but this is not always the case
I have tried creating a regex to use with re.sub()
but I have not gotten anywhere - not the match (latest attempt was (?<= )[a-z]* (.* [a-z]*)(?= )
but that does not work) nor a substitution group.
Thank you!