1

I'm importing from a .txt file containing some David Foster Wallace that I copy-pasted from a PDF. Some words ran off the page and so come in the form of

"interr- upted"

I was going to sanitize it by using something like:

with open(text, "r", 0) as bookFile:
    bookString = bookFile.read().replace("- ", "")

Except... the man also uses some weird constructions in his writing. Things like:

"R - - d©"

for the brand name bug spray Raid©. I'm left with "R d©" obviously, but is there a way to make it .replace() instances of "- " but not instances of " - "? Or do I need to turn everything into lists and do operations to everything that way? Thanks.

martineau
  • 119,623
  • 25
  • 170
  • 301
Luke McPuke
  • 354
  • 1
  • 2
  • 12
  • How would you define this condition? Is it only if there are one or more letters, hyphen, space, one or more letters? – jacoblaw Jul 01 '17 at 16:34
  • Good point. I want the case to be more general, so that when I do the same to future books any instance of "a - - b" won't get thrown away, but similar run-off words in the form of "ab- c" will get turned into "abc". – Luke McPuke Jul 01 '17 at 17:02

3 Answers3

3

You could use a regular expression with a negative lookbehind assertion to check the previous character, and re.sub to replace matches with an empty string.

'(?<! )- ' is a regular expression, matching all instances of '- ', not preceded by a single space character (refer to this section for the syntax). re.sub('(?<! )- ', '', input_string) will replace all occurrences of the '(?<! )- ' pattern in input_string with '' (empty string) and return the result.

Examples:

In [1]: import re

In [2]: re.sub('(?<! )- ', '', 'interr- upted')
Out[2]: 'interrupted'

In [3]: re.sub('(?<! )- ', '', 'R - - d©')
Out[3]: 'R - - d©'
vaultah
  • 44,105
  • 12
  • 114
  • 143
  • This worked perfectly - my input was just the entire text file as one string, and it truncated every instance of "- " without destroying "R - - d©" or "f - - k" or any regular hyphenated words. Is there any chance you could explain what's going on in the re.sub() args you chose? The documentation is a little confusing, never used regular expressions before. – Luke McPuke Jul 01 '17 at 17:15
  • 1
    @LukeMcPuke I tried to explain it better, check the updated answer. For the complete explanation of that regular expression see the linked documentation and [this](https://regex101.com/r/5HNEha/1) page – vaultah Jul 01 '17 at 18:11
2

You can use lookbehinds and lookaheads to make sure you substitute only the occurrences that need to be substituted:

>>> import re
>>> regex_pattern = '(?<=[a-z])(- )(?=[a-z])'
>>> re.sub(regex_pattern, '', "interr- upted", re.I)
'interrupted'

And,

>>> re.sub(regex_pattern, '', "R - - d©")
'R - - d©'

The latter is not affected.

cs95
  • 379,657
  • 97
  • 704
  • 746
-2

is this what you need?

In [23]: import re
In [24]: re.sub(r'- ', '', '"R - - d"')
Out[24]: '"R d"'

This link can help you.

HTH