2

I want to be able to delete all instances of newlines within <p> tags, but not the ones outside. Example:

<p dir="ltr">Test<br>\nA\naa</p>\n<p dir="ltr">Bbb</p>

This is the regex I came up with:

(<p[^>]*?>)(?:(.*)\n*)*(.*)(</p[^>]*?>)

and I replace with:

$1$2$3$4

I was hoping that this would work but (?:(.*)\n*)* seems to be causing issues. Is there any way to do repeated matches like this, with a capturing group?

Thanks in advance!

Srini
  • 1,626
  • 2
  • 15
  • 25
thisbytes
  • 797
  • 1
  • 5
  • 14

1 Answers1

2

Solution

You can use this regex(works in PCRE but not in Java. For Java version refer below)

(?s)(?:<p|\G(?!\A))(?:(?!<\/p>).)*?\K[\n\r]+

Regex Demo

Regex Breakdown

(?s) #Enable . to match newlines

(?:
   <p #this part is to assure that whatever we find is inside <p tag
    | #Alternation(OR)
   \G(?!\A) #Find the position of starting of previous match.
)

(?:
  (?!<\/p>). #Till it is impossible to match </p>, match .
)*? #Do it lazily

\K #Whatever is matched till now discard it

[\n\r]+ #Find \n or \r

Java Code

With a bit of modification, I was able to do it in Java

String line = "<p dir=\"ltr\">Test<br>\nA\naa</p>\nabcd\n<p dir=\"ltr\">Bbb</p>"; 
System.out.println(line.replaceAll("(?s)((?:<p|\\G(?!\\A))(?:(?!<\\/p>).)*?)[\\n\\r]+", "$1"));

Ideone Demo

rock321987
  • 10,942
  • 1
  • 30
  • 43