2

I have a few thousand badly parsed text files that show some interesting behavior for somewhere between 10% and 30% of their length. I unfortunately do not have the original data, so I can't attempt to re-parse, but pretty much every file needs to be (partially cleaned)

Example Input


text = 'The European  l a n g u a g es  ar e  members  of  the  same  fa m i l y 
. Their  sep a rate  e xi ste nce  is a myth .  F or  s c i e n c e , music, 
sport , etc, Europe uses the  s a m e  v oca bula ry. The languages  o n l y  d 
i f f e r  i n  t heir  grammar, their  pro nu n c iation  and their most common 
words. Everyone realizes why a new common language would be desirable: one could 
refuse to pay expensive translators.'


Expected Output


'The European languages are members of the same family. Their separate existence 
 i s  a myth. For science, music, sport, etc, Europe uses the same vocabulary. The 
languages only differ in their grammar, their pronunciation and their most 
common words. Everyone realizes why a new common language would be desirable: 
one could refuse to pay expensive translators.'


There does not seem to be much regularity from one weird formatting to another, and no clear "cause" or trigger words or symbols. Just one thing I noticed: The words in strange formatting are separated by two spaces (except sometimes before punctuation, but that is a simple text.replace(' ,',',')).

Question

How do I remove all the spaces from a string that are bracketed between pairs of double spaces? I assume there is a regex that I just haven't thought about...


Some more Info

I do not know how many of these weird parts/letters there are per document, and I do not know the content of the documents. The only other things I am reasonably certain of are:

  • shortest fragment length is 1 character ("members" could be "m e m b e r s") and may be much longer (such as in "anticip ated")
  • punctuation may be preceeded by a single space, but this is not always the case

I have tried creating a regex to use with re.sub() but I have not gotten anywhere - not the match (latest attempt was (?<= )[a-z]* (.* [a-z]*)(?= ) but that does not work) nor a substitution group.

Thank you!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
jhl
  • 671
  • 6
  • 23
  • 1
    Why do you wish for `existence i s a myth`? – MonkeyZeus Aug 29 '19 at 19:08
  • You may try: `re.sub(r'\s{2}(\w+(?:\s\w+)*(?=\s\W|$))', lambda m: ' ' + m.group(1).replace(' ', ''), text)` – anubhava Aug 29 '19 at 21:49
  • "Between two double spaces" is tricky. Take for example `...r i n t heir grammar, their p...`. Here, `grammar, their` is _also_ between two double-spaces! (Seems like inline code-highlight does not show double spaces, but you can see the part in your original text.) – tobias_k Aug 30 '19 at 15:28

2 Answers2

0

If there's no pattern some suggestions:

  1. Replace all spaces that are not a single space.
  2. Then check each word against a dictionary. myDictionary.exists(word)
  3. The odd spaces may have been the beginning or ending of text formatting. Check the unicode of the space character.
  4. Attempt to get the original again or contact the author that's sending you the text

In suggestion 2 check if the word is a word. If not then add the next character and check again. Keep doing that until you find a word. It won't work with every word but "l a n g u a g es" will turn into "languages" except for "la" and "lan". So even if you find a word keep adding characters until it turns into a word again or you get to a limit of around 16 characters.

In pseudo code:

replace all spaces more than one space
split string into an array based on single space
loop through each word
check if word exists in english language
add characters until you get a match
move to next word
for punctuation if a punctuation character is at the beginning of a character or in between two spaces remove the previous space character.

How to check if a word is an English word with Python?

Calculon
  • 16
  • 1
0

I would do it in three steps (five if you follow the optionals):

  1. First matching text.replace(' *','(@)') (three spaces before the asterisk). Convert all those space pairs (or more than two) into some token you can be sure will not appear in the text (I used (@) as an example) as is shown in demo1. This is to avoid two (or more) space sequences to be considered as sequences of single spaces (as below we are going to erase those)
  2. Next, text.replace(' ',''). Convert all single spaces into the empty string, as seen in demo2. this will join many words that are separated by a single space in your sample text, be careful.
  3. Finally, text.replace('\(@\)',' '). Convert all the tokens from the first step into single spaces, as in demo3.
  4. [optional] text.replace(' *([.!?]) *([A-Z])','. $1'). If you also convert all dots followed by an uppercase character into a dot, followed by two spaces, and the matched uppercase character, then you'll get a more beautifull aspect. As in demo4.
  5. [optional] text.match(' *([,;:]) *','$1 ')'). Do the same with other punctuation symbols but with only one space.

You can do this with sed(1) as in:

$ sed -e 's/   */#@#/g' \
      -e 's/ //g' \
      -e 's/#@#/ /g' \
      -e 's/ *\([.!?]\)  *\([A-Z]\)/\1  \2/g' \
      -e 's/ *\([,;:]\) */\1 /g' \
      <<EOF
The European  l a n g u a g es  ar e  members  of
the  same  fa m i l y . Their  sep a rate  e xi ste nce
is a myth .  F or  s c i e n c e , music, sport ,
etc, Europe uses the  s a m e  v oca bula ry. The
languages  o n l y  d i f f e r  i n  t heir
grammar, their  pro nu n c iation  and their most
common words. Everyone realizes why a new common
language would be desirable: one could 
refuse to pay expensive translators.
EOF
TheEuropean languages are members of
the same family.  Their separate existence
isamyth. For science, music, sport,
etc, Europeusesthe same vocabulary.  The
languages only differ in their
grammar, their pronunciation andtheirmost
commonwords. Everyonerealizeswhyanewcommon
languagewouldbedesirable: onecould
refusetopayexpensivetranslators.
$ _

Last example also converted [,;:] into them plus a space, and did the sentence separation also for ? and ! marks.

How do I remove all the spaces from a string that are bracketed between pairs of double spaces?

don't consider n spaces between two... this is the same as two or more, simply text.replace(' *',' ') (three spaces before *), or replace a string of two or more spaces, with a string of just two. The same can be achieved with text.replace(' +',' ')' (two spaces before +).

Luis Colorado
  • 10,974
  • 1
  • 16
  • 31
  • Thanks Luis, I appreciate the effort! And while that technically solves the problem, it introduces a new one. So while the part of text that was previously broken is now fixed, it breaks the other parts that were previously ok. I think the replacement of two spaces by another token is a good idea, but I'm having difficulty translating the pseudocode "replace all spaces between two tokens" to code. – jhl Aug 31 '19 at 10:12
  • yes, but it does as you specified... it's impossible to know beforehand which _single spaces_ must be preserved and which must be erased. You have both. At least it reduces the amount of text to be manually checked. – Luis Colorado Sep 02 '19 at 19:03
  • The `replace` commands follow your own examples... I don't know python, so I suggested you to do it with sed, which is a standard unix command. – Luis Colorado Sep 02 '19 at 19:05