New line symbols are not detected by regex

Question

I import data from csv and read it with pandas:

train = pd.read_csv('yelp_review_full_csv/train.csv',
                    header=None,
                    names=['Class', 'Review'])

reviews = train['Review']

and willing to get rid of new line symbols - \n using regex:

print(reviews[3])
rex = re.sub("\\n+", " ", reviews[3])
print(rex)

which gives me an output:

... much.  \n\nI think ...
... much.  \n\nI think ...

If I copy the output and check it with regex, then I have a desired result. I guess there should be something with csv reading, any recommendations?

I cannot reproduce the erroneous output. What yields print(reviews[3])? — Ronald, Jun 30 '20 at 15:30
Why are you escaping the backslash? Doesn't `rex = re.sub(r"\n+", " ", reviews[3])` work for you? — Toto, Jun 30 '20 at 15:33
@Toto It looks like his text has literal `\n` in it, not newline characters. — Barmar, Jun 30 '20 at 15:34
@Ronald, I've edited the code. fyi reviews[3] = "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new position there in June. He will be missed very much. \n\nI think finding a new doctor" — superpen, Jun 30 '20 at 21:00

score 1 · Accepted Answer · answered Jun 30 '20 at 15:37

Your text contains literal \n in it, not newlines.

The regexp \n matches a newline, not literal \n. To match \n you need to use the regexp \\n. Escaping the backslash just allows the backslash to be passed to the regexp parser. You need to double-escape it so that the regexp will match \n, or use a raw string.

rex = re.sub(r"(\\n)+, " ", reviews[3])

See What exactly is a "raw string regex" and how can you use it?

New line symbols are not detected by regex

1 Answers1