4

I am parsing a text file and want to remove all in-paragraph line breaks, while actually keeping the double line feeds that form new paragraphs. e.g.

This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n

When printed out, this should look like this:

This is my first poem
that does not make sense
how far should it go
nobody can know.

Here is a seconds
that is not as long
goodbye

should become

This is my first poem that does not make sense how far should it go nobody can know.\n\nHere is a seconds that is not as long goodbye\n\n

Again, when printed, it should look like:

This is my first poem that does not make sense how far should it go nobody can know.

Here is a seconds that is not as long goodbye

The trick here is in removing single occurrances of '\n', while keeping the double line feed '\n\n', AND in preserving white space (i.e. "hello\nworld" becomes "hello world" and not "helloworld").

I can do this by first substituting the \n\n with a dummy string (like "$$$", or something equally ridiculous), then removing the \n followed by reconversion of "$$$" back to \n\n...but that seems overly circuitous. Can I make this conversion with a single regular expression call?

tnknepp
  • 5,888
  • 6
  • 43
  • 57

1 Answers1

6

You may replace all newlines that are not enclosed with other newlines with a space:

re.sub(r"(?<!\n)\n(?!\n)", " ", s)

See the Python demo:

import re
s = "This is my first poem\nthat does not make sense\nhow far should it go\nnobody can know.\n\nHere is a seconds\nthat is not as long\ngoodbye\n\n"
res = re.sub(r"(?<!\n)\n(?!\n)", " ", s)
print(res)

Here, the (?<!\n) is a negative lookbehind that fails the match if a newline is receded with another newline, and (?!\n) is a negative lookahead that fils the match of the newline is followed with another newline.

See more about Lookahead and Lookbehind Zero-Length Assertions here.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563