2

Is there a way to replace all types of the character newline in python by "\n"? The most common newline characters seem to be "\n" and "\r" but in wikepedia you can find different representations. I am looking for something like:

For whitespaces (using re):

txt = re.sub(r'[\s]+',' ',txt)

For hyphens (using regex).. See reference here:

txt = regex.sub(r'\p{Pd}+', '-', txt)
DanielTheRocketMan
  • 3,199
  • 5
  • 36
  • 65
  • What is your question? Is it to normailze text or to replace newline characters only? – Vishnudev Krishnadas Dec 14 '18 at 19:22
  • 1
    Try `regex.sub(r'\R', '\n', s)` – Wiktor Stribiżew Dec 14 '18 at 19:24
  • Just to replace newline characters. I will remove the last line since it may be causing ambiquity. – DanielTheRocketMan Dec 14 '18 at 19:24
  • @WiktorStribiżew your answer is great and it works very well. However, although I have been using regex as above to normalize different types of hyphens, regex.sub(r'\R', '\n', s) didnt work for me in this case! – DanielTheRocketMan Dec 14 '18 at 19:51
  • 1
    This is almost a dupe https://stackoverflow.com/questions/4388630/unicode-regexp-to-match-line-breaks – revo Dec 14 '18 at 19:57
  • 1
    Two more https://stackoverflow.com/questions/3445326/regex-in-java-how-to-deal-with-newline/3445417#3445417 and https://stackoverflow.com/questions/40928114/simple-regex-working-on-windows-and-not-in-linux-using-java – revo Dec 14 '18 at 20:01
  • @revo Sorry. I was looking (maybe) for a simpler python solution (using python modules). In the end, the final solution is close to the java one or general one. – DanielTheRocketMan Dec 14 '18 at 20:04

2 Answers2

4

There is a \R construct that you may use in Python PyPi regex module. However, even with re, you may use its equivalent:

re.sub(r'\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]', '\n', s) 

See the Python demo.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
1

To replace any \r (carriage return) by \n (new line) :

txt = re.sub(r"\r", "\n", txt)

r before double quote means raw string to escape the slash.

Indent
  • 4,675
  • 1
  • 19
  • 35