10

I have a long string which is a paragraph, however there is no white space after periods. For example:

para = "I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women.It is in black and white but saves the colour for one shocking shot.At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene.Avoid."

I am trying to use re.sub to solve this problem, but the output is not what I expected.

This is what I did:

re.sub("(?<=\.).", " \1", para)

I am matching the first char of each sentence, and I want to put a white space before it. My match pattern is (?<=\.)., which (supposedly) checks for any character that appears after a period. I learned from other stackoverflow questions that \1 matches the last matched pattern, so I wrote my replace pattern as \1, a space followed by the previously matched string.

Here is the output:

"I saw this film about 20 years ago and remember it as being particularly nasty. \x01I believe it is based on a true incident: a young man breaks into a nurses\' home and rapes, tortures and kills various women. \x01t is in black and white but saves the colour for one shocking shot. \x01t the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. \x01void. \x01

Instead of matching any character preceded by a period and adding a space before it, re.sub replaced the matched character with \x01. Why? How do I add a character before a matched string?

Seanny123
  • 8,776
  • 13
  • 68
  • 124
versatile parsley
  • 411
  • 2
  • 6
  • 15
  • 1
    http://stackoverflow.com/a/12597709/2850543 – Millie Smith Mar 11 '17 at 06:14
  • 1
    I think the question should say *"however there is no white space after **some** periods."* as there is white space after the the first period? – Sash Sinha Mar 11 '17 at 06:40
  • @shash678 yes, you are right. I didn't say that because in my case it's okay to have multiple white spaces and I didn't want to make the question complicated – versatile parsley Mar 11 '17 at 07:50
  • There's always the cheap answer: `text = text.replace(".", ". ").replace(". " + " ", ". ")` (string concatenation is because stack exchange is eating the double space). Basically, replace each period with period + space, replace each period + space + space with a period + single space. No regexes required, and you don't have to import anything.. – Fake Name Mar 11 '17 at 09:29

5 Answers5

9

The (?<=a)b is a positive lookbehind. It matches b following a. The a is not captured. So in your expression, I'm not sure what the value of \1 represents in this case, but it's not what's inside of (?<=...).

Your current approach has another flaw: it would add a space after a . even when one is already there.

To add missing space after ., I suggest a different strategy: replace .-followed-by-non-space-non-dot with . and a space:

re.sub(r'\.(?=[^ .])', '. ', para)
janos
  • 120,954
  • 29
  • 226
  • 236
2

You may perhaps use the following regex (with a positive look-behind and negative look-ahead assertion):

(?<=\.)(?!\s)

python

re.sub(r"(?<=\.)(?!\s)", " ", para)

see demo

m87
  • 4,445
  • 3
  • 16
  • 31
2

A slightly modified version of your regex will also work:

print re.sub(r"([\.])([^\s])", r"\1 \2", para)

# I saw this film about 20 years ago and remember it as being particularly nasty. I believe it is based on a true incident: a young man breaks into a nurses' home and rapes, tortures and kills various women. It is in black and white but saves the colour for one shocking shot. At the end the film seems to be trying to make some political statement but it just comes across as confused and obscene. Avoid.
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63
1

I think this is what you want to do. You can pass a function in to do the replacement.

import re

def my_replace(match):
    return " " + match.group()

my_string = "dhd.hd hd hs fjs.hello"
print(re.sub(r'(?<=\.).', my_replace, my_string))

Prints:

dhd. hd hd hs fjs. hello

As @Seanny123 pointed out, this will add a space even if there was already a space after the period.

Millie Smith
  • 4,536
  • 2
  • 24
  • 60
  • 3
    This answer adds a space if there's already an existing space. – Seanny123 Mar 11 '17 at 06:31
  • @seanny123 OP states "there is no white space after periods". We can argue about requirements all day. I'm on a phone and am not going to be bothered to perfect this. Just don't upvote and move along, mate. – Millie Smith Mar 11 '17 at 06:35
  • Sorry, about the tone and content. Was trying to be informative and messed up. My bad. – Seanny123 Mar 11 '17 at 06:36
  • @seanny123 nah mate you're good. You're right. I'm just tired and on my phone so it's hard to type out meaningful replies. I'll add a piece to my amswer that adds your comment in aince it's valuable information. – Millie Smith Mar 11 '17 at 06:38
  • @Seanny123 was correct, and this solution is valid because in this case I'm okay with multiple whitespaces. If we were to generalize this solution a bit, then we'd need to take care of extra whitespace. – versatile parsley Mar 11 '17 at 08:01
0

The simplest regex substitution you can use is this one:

re.sub(r'\.(?=\w)', '. ', para)

It simply matches each period, and uses the lookahead, (?=\w) to make sure there is a word character next, and not already a space after the period and replaces it with .

micsthepick
  • 562
  • 7
  • 23