4

I want to remove the line returns of a text that is wrapped to a certain width. e.g.

import re
x = 'the meaning\nof life'
re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'

I want to return the meaning of life. What am I doing wrong?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
geotheory
  • 22,624
  • 29
  • 119
  • 196
  • The problem is not with the regex, but with the replacement string, which still has to use Python string literal escapes for the backslash. Thus `"\\1 \\2"`, or `r"\1 \2"` but not `"\1 \2"`. – Karl Knechtel Aug 08 '22 at 02:49

2 Answers2

3

You need escape that \ like this:

>>> import re
>>> x = 'the meaning\nof life'

>>> re.sub("([,\w])\n(\w)", "\1 \2", x)
'the meanin\x01 \x02f life'

>>> re.sub("([,\w])\n(\w)", "\\1 \\2", x)
'the meaning of life'

>>> re.sub("([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'
>>>

If you don't escape it, the output is \1, so:

>>> '\1'
'\x01'
>>> 

That's why we need use '\\\\' or r'\\'to display a signal \ in Python RegEx.

However about that, from this answer:

If you're putting this in a string within a program, you may actually need to use four backslashes (because the string parser will remove two of them when "de-escaping" it for the string, and then the regex needs two for an escaped regex backslash).

And the document:

As stated earlier, regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This conflicts with Python's usage of the same character for the same purpose in string literals.

Let's say you want to write a RE that matches the string \section, which might be found in a LaTeX file. To figure out what to write in the program code, start with the desired string to be matched. Next, you must escape any backslashes and other metacharacters by preceding them with a backslash, resulting in the string \\section. The resulting string that must be passed to re.compile() must be \\section. However, to express this as a Python string literal, both backslashes must be escaped again.


Another way as brittenb suggested, you don't need RegEx in this case:

>>> x = 'the meaning\nof life'
>>> x.replace("\n", " ")
'the meaning of life'
>>> 
Community
  • 1
  • 1
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
2

Use raw string literals; both Python string literal syntax and regex interpret backslashes; \1 in a python string literal is interpreted as an octal escape, but not in a raw string literal:

re.sub(r"([,\w])\n(\w)", r"\1 \2", x)

The alternative would be to double all backslashes so that they reach the regex engine as such.

See the Backslash plague section of the Python regex HOWTO.

Demo:

>>> import re
>>> x = 'the meaning\nof life'
>>> re.sub(r"([,\w])\n(\w)", r"\1 \2", x)
'the meaning of life'

It might be easier just to split on newlines; use the str.splitlines() method, then re-join with spaces using str.join():

' '.join(ex.splitlines())

but admittedly this won't distinguish between newlines between words and extra newlines elsewhere.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343