Is there a way to .replace() certain string snippets according to a criteria?

Question

I'm importing from a .txt file containing some David Foster Wallace that I copy-pasted from a PDF. Some words ran off the page and so come in the form of

"interr- upted"

I was going to sanitize it by using something like:

with open(text, "r", 0) as bookFile:
    bookString = bookFile.read().replace("- ", "")

Except... the man also uses some weird constructions in his writing. Things like:

"R - - d©"

for the brand name bug spray Raid©. I'm left with "R d©" obviously, but is there a way to make it .replace() instances of "- " but not instances of " - "? Or do I need to turn everything into lists and do operations to everything that way? Thanks.

How would you define this condition? Is it only if there are one or more letters, hyphen, space, one or more letters? — jacoblaw, Jul 01 '17 at 16:34
Good point. I want the case to be more general, so that when I do the same to future books any instance of "a - - b" won't get thrown away, but similar run-off words in the form of "ab- c" will get turned into "abc". — Luke McPuke, Jul 01 '17 at 17:02

vaultah · Accepted Answer · 2017-07-01T18:06:42.477

3

You could use a regular expression with a negative lookbehind assertion to check the previous character, and re.sub to replace matches with an empty string.

'(?<! )- ' is a regular expression, matching all instances of '- ', not preceded by a single space character (refer to this section for the syntax). re.sub('(?<! )- ', '', input_string) will replace all occurrences of the '(?<! )- ' pattern in input_string with '' (empty string) and return the result.

Examples:

In [1]: import re

In [2]: re.sub('(?<! )- ', '', 'interr- upted')
Out[2]: 'interrupted'

In [3]: re.sub('(?<! )- ', '', 'R - - d©')
Out[3]: 'R - - d©'

edited Jul 01 '17 at 18:06

answered Jul 01 '17 at 16:33

vaultah

44,105
12
114
143

This worked perfectly - my input was just the entire text file as one string, and it truncated every instance of "- " without destroying "R - - d©" or "f - - k" or any regular hyphenated words. Is there any chance you could explain what's going on in the re.sub() args you chose? The documentation is a little confusing, never used regular expressions before. – Luke McPuke Jul 01 '17 at 17:15
1

@LukeMcPuke I tried to explain it better, check the updated answer. For the complete explanation of that regular expression see the linked documentation and [this](https://regex101.com/r/5HNEha/1) page – vaultah Jul 01 '17 at 18:11

score 2 · Answer 2 · answered Jul 01 '17 at 16:38

You can use lookbehinds and lookaheads to make sure you substitute only the occurrences that need to be substituted:

>>> import re
>>> regex_pattern = '(?<=[a-z])(- )(?=[a-z])'
>>> re.sub(regex_pattern, '', "interr- upted", re.I)
'interrupted'

And,

>>> re.sub(regex_pattern, '', "R - - d©")
'R - - d©'

The latter is not affected.

score -2 · Answer 3 · answered Jul 01 '17 at 16:42

-2

is this what you need?

In [23]: import re
In [24]: re.sub(r'- ', '', '"R - - d"')
Out[24]: '"R d"'

This link can help you.

HTH

answered Jul 01 '17 at 16:42

Fabio Xanti

21
4

No, OP very clearly mentioned this is not what they need. – cs95 Jul 01 '17 at 16:43

Is there a way to .replace() certain string snippets according to a criteria?

3 Answers3