0

I wrote a code where I replaced all the whitespaces with new lines from a .txt file which is basically a novel. Doing this, seperated the words in new lines for me, but there are some empty lines and I want to remove those. So I am trying to remove all the whitespaces except the new lines. How might I do that using regex?

First I did this to replace whitespaces with new lines:

text= re.sub(r'\s',r'\n',text)

Then I tried this to remove empty lines which is not doing the job actually:

text= re.sub(r'(\s)(^\n)',"",text)
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Nakkhatra
  • 83
  • 10

1 Answers1

2

You may use:

text = re.sub(r'[^\S\r\n]+', '', text)

The regex pattern [^\S\r\n]+ will match any whitespace character (read not \S, which means \s) except for newlines and carriage returns.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • But actually it should do the job of removing whitespaces except new lines, but I am still getting those empty lines, my aim is to remove those empty lines – Nakkhatra May 22 '21 at 15:36
  • You could try `re.sub(r'\n{2,}', '\n', text)`, which would replace sequences of two or more newlines with just a single newline. – Tim Biegeleisen May 22 '21 at 15:48
  • Hi, actually there's a problem with the method you mentioned. I have a string like this: "the\n tragedy\n of\n Mcbeth\n \n Actus\n Primus" Doing that way leaves: "the\n tragedy\n of\n Mcbeth Actus\n Primus" Which leaves Mcbeth and Actus in the same line, but I actually need these two in different rows while removing the empty line in between – Nakkhatra May 22 '21 at 18:55
  • Actually I want to keep only one of the \n if I get two or more \n – Nakkhatra May 22 '21 at 18:59
  • Sorry for all these comments, I did it by replacing the r'[\n]{2, }' with '\n' – Nakkhatra May 22 '21 at 19:01