Regex capturing within a group

Question

I want to be able to delete all instances of newlines within <p> tags, but not the ones outside. Example:

<p dir="ltr">Test<br>\nA\naa</p>\n<p dir="ltr">Bbb</p>

This is the regex I came up with:

(<p[^>]*?>)(?:(.*)\n*)*(.*)(</p[^>]*?>)

and I replace with:

$1$2$3$4

I was hoping that this would work but (?:(.*)\n*)* seems to be causing issues. Is there any way to do repeated matches like this, with a capturing group?

Thanks in advance!

there are two `p` tags? you want `\n` to be removed separately for them? — rock321987, May 23 '16 at 18:12
Separately for `p` tags is fine. Its just that I'm hoping to replace all the `\n` within the `p` tags in one fell swoop. I was hoping that its possible with regex without nested loops. — thisbytes, May 23 '16 at 18:16
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — ThePerson, May 23 '16 at 18:24
I would recommend using something like JSoup for this kind of work. — ThePerson, May 23 '16 at 18:25

rock321987 · Accepted Answer · 2016-05-24T09:13:10.850

2

Solution

You can use this regex(works in PCRE but not in Java. For Java version refer below)

(?s)(?:<p|\G(?!\A))(?:(?!<\/p>).)*?\K[\n\r]+

Regex Demo

Regex Breakdown

(?s) #Enable . to match newlines

(?:
   <p #this part is to assure that whatever we find is inside <p tag
    | #Alternation(OR)
   \G(?!\A) #Find the position of starting of previous match.
)

(?:
  (?!<\/p>). #Till it is impossible to match </p>, match .
)*? #Do it lazily

\K #Whatever is matched till now discard it

[\n\r]+ #Find \n or \r

Java Code

With a bit of modification, I was able to do it in Java

String line = "<p dir=\"ltr\">Test<br>\nA\naa</p>\nabcd\n<p dir=\"ltr\">Bbb</p>"; 
System.out.println(line.replaceAll("(?s)((?:<p|\\G(?!\\A))(?:(?!<\\/p>).)*?)[\\n\\r]+", "$1"));

Ideone Demo

edited May 24 '16 at 09:13

answered May 23 '16 at 18:28

rock321987

10,942
1
30
43

Holy... Wow. that's pretty darn amazing. – thisbytes May 23 '16 at 18:31
damn my regex noobness! good job rock - i was too slow to be the savior. – zec May 23 '16 at 18:32
@Jun first let me check it in JAVA – rock321987 May 23 '16 at 18:32
You knew that you can put in [comments in regex101.com](https://regex101.com/r/nA3wS1/2) as well ? Additionally, why put the `\n` in a character class? It's only one character! +1 nevertheless. – Jan May 23 '16 at 18:35
I really need to try to understand your regex now. Its going to be hard :) Thanks! – thisbytes May 23 '16 at 18:45
@Jun added a bit of explanation..its starter only..you have to read it in detail for full understanding – rock321987 May 23 '16 at 18:56
@downvoter..please leave a comment before downvoting..this is ridiculous and frustrating – rock321987 May 23 '16 at 19:00
wasn't me, but my best guess is that it has something to do with "works in PCRE but not Java" ? – Scott Weaver May 23 '16 at 19:13
@sweaver2112 whoever it was should have read at least the full answer – rock321987 May 23 '16 at 19:15
why not just remove the PCRE stuff? Also, can you explain why your Java solution is "a bit of a hack" ? – Scott Weaver May 23 '16 at 19:16
3

Just was about to add answer `"(?s)\\n+(?=(?:(?!
– bobble bubble May 23 '16 at 19:18
@sweaver2112 I won't remove PCRE stuff because it guided me to the answer..well its not really a hack but if Java supported `\K`, it would have been much easier – rock321987 May 23 '16 at 19:20
If you want to integrate in answer, I feel honored, else just leave as comment. I go for bed now (: – bobble bubble May 23 '16 at 19:23
@bobblebubble in any case, you should add it..as it is better than mine..:) – rock321987 May 23 '16 at 19:26
1

@rock321987 Your pattern is more accurate and if there is a long html input with many `\n` outside of `
– bobble bubble May 23 '16 at 19:35
@bobblebubble but there is one problem..your pattern does not consider to modify within `p` tag..see **[here](https://regex101.com/r/nA3wS1/4)**..i am removing your update..please make it work? – rock321987 May 23 '16 at 19:45
@rock321987 You mean it does not check for an opening `
– bobble bubble May 23 '16 at 20:18

Regex capturing within a group

1 Answers1