-2

I import data from csv and read it with pandas:

train = pd.read_csv('yelp_review_full_csv/train.csv',
                    header=None,
                    names=['Class', 'Review'])

reviews = train['Review'] 

and willing to get rid of new line symbols - \n using regex:

print(reviews[3])
rex = re.sub("\\n+", " ", reviews[3])
print(rex)

which gives me an output:

... much.  \n\nI think ...
... much.  \n\nI think ...

If I copy the output and check it with regex, then I have a desired result. I guess there should be something with csv reading, any recommendations?

superpen
  • 3
  • 2
  • What is `reviews` and how does it relate to `train`? – Barmar Jun 30 '20 at 15:25
  • I cannot reproduce the erroneous output. What yields print(reviews[3])? – Ronald Jun 30 '20 at 15:30
  • Why are you escaping the backslash? Doesn't `rex = re.sub(r"\n+", " ", reviews[3])` work for you? – Toto Jun 30 '20 at 15:33
  • @Toto It looks like his text has literal `\n` in it, not newline characters. – Barmar Jun 30 '20 at 15:34
  • @Barmar, sorry for that, I've edited the code – superpen Jun 30 '20 at 20:53
  • @Ronald, I've edited the code. fyi reviews[3] = "Got a letter in the mail last week that said Dr. Goldberg is moving to Arizona to take a new position there in June. He will be missed very much. \n\nI think finding a new doctor" – superpen Jun 30 '20 at 21:00

1 Answers1

1

Your text contains literal \n in it, not newlines.

The regexp \n matches a newline, not literal \n. To match \n you need to use the regexp \\n. Escaping the backslash just allows the backslash to be passed to the regexp parser. You need to double-escape it so that the regexp will match \n, or use a raw string.

rex = re.sub(r"(\\n)+, " ", reviews[3])

See What exactly is a "raw string regex" and how can you use it?

Barmar
  • 741,623
  • 53
  • 500
  • 612