Perl Substitution regexp with non capturing group

Question

I have the following stored inside $text:

<h1>Bonjour tout le monde (diverses langues) !</h1>

<h2>Anglais</h2>

Hello World!
<quote>Every first computer program starts out "Hello World!".</quote>

<h2>Espagnol</h2>

¡Hola mundo!

<image=http://example.com/IMG/jpg/person.jpg>

And I want to insert some

<p>...</p>

tags around the paragraphs that are not already in a tag.

I tried this

$text =~ s/(?:<.*>)*(.*)/<p>$1<\/p>/g;

But the substitution does not keep my non-capturing groups. It produces this instead:

<p>

</p><p>

Hello World!
</p><p>

</p><p>

¡Hola mundo!

</p><p>
</p><p></p>

Any ideas ?

Thanks.

[Here is what you are looking for](http://stackoverflow.com/a/1732454/801553) — Patrick J. S., Apr 24 '15 at 16:41
You also have an angle bracket in `[ESADSE->www.esadse.fr/]` which is supposed to be plain text. If this is anything like XML then that either needs to be replaced with an entity as `[ESADSE->www.esadse.fr/]` or the section needs to be marked as CDATA — Borodin, Apr 24 '15 at 17:07
no it's not xml. It is not a common data format. And I don't want to parse HTML with regexp, I want to parse my specific format to HTML. — Nooxx, Apr 24 '15 at 17:13

score 1 · Answer 1 · answered Apr 24 '15 at 18:20

1

s/// replaces what it matched.

You can use

$text =~ s/((?:<.*>)*)(.*)/$1<p>$2<\/p>/g;

Text matched by a look-ahead or a look-behind is not considered part of the match. Neither is the text matched before a \K is encountered.

$text =~ s/(?:<.*>)*\K(.*)/<p>$1<\/p>/g;

The second solution requires Perl 5.10+.

answered Apr 24 '15 at 18:20

ikegami

367,544
15
269
518

Very impressive, the `\K` solution. – Marcus May 21 '20 at 09:44
@Marcus, I love `\K`. "Insert after match" is such a common operation, and `\K` makes it simple. – ikegami May 21 '20 at 10:02

l'L'l · Accepted Answer · 2015-04-24T18:05:35.180

0

Perhaps try using a pattern which only looks for lines that don't start or end with < >. Including \n would also be recommended as you wouldn't want every line only containing a line feed to get <p></p> tags:

$text =~ s/(^[^<\n]+.+|.+[^\/\n>]+$)/<p>$1<\/p>/gm;

Example:

http://ideone.com/p55Ino

edited Apr 24 '15 at 18:05

answered Apr 24 '15 at 17:43

l'L'l

44,951
10
95
146

Works according to my tests. Good job! – Christopher Bottoms Apr 24 '15 at 17:48
I updated the pattern slightly, which should match text that might have `< >` within it... (eg. `Hello > World`). If you want something more general then you could use `^([^\n<>]+)$/
$1<\/p>/gm;`
– l'L'l Apr 24 '15 at 17:50
1

@Nooxx: Keep in mind that this solution is very limited. All elements must be contained within a single line. If that works for you, then great. Otherwise, you may not want to be using regex for it, especially if elements may be nested. – Brian Stephens Apr 24 '15 at 17:58
Yes, I'll agree. In light of that, I've further updated the pattern to handle some limited nested occurrences containing `< >` (eg. `<¡Hola mundo!>Nested Text<--`) – l'L'l Apr 24 '15 at 18:06

Perl Substitution regexp with non capturing group

2 Answers2