0

I have the following stored inside $text:

<h1>Bonjour tout le monde (diverses langues) !</h1>

<h2>Anglais</h2>

Hello World!
<quote>Every first computer program starts out "Hello World!".</quote>

<h2>Espagnol</h2>

¡Hola mundo!

<image=http://example.com/IMG/jpg/person.jpg>

And I want to insert some

<p>...</p>

tags around the paragraphs that are not already in a tag.

I tried this

$text =~ s/(?:<.*>)*(.*)/<p>$1<\/p>/g;

But the substitution does not keep my non-capturing groups. It produces this instead:

<p>

</p><p>

Hello World!
</p><p>

</p><p>

¡Hola mundo!

</p><p>
</p><p></p>

Any ideas ?

Thanks.

Christopher Bottoms
  • 11,218
  • 8
  • 50
  • 99
Nooxx
  • 11
  • 2
  • [Here is what you are looking for](http://stackoverflow.com/a/1732454/801553) – Patrick J. S. Apr 24 '15 at 16:41
  • Is that your real data? It is neither XML nor HTML. – Borodin Apr 24 '15 at 16:53
  • You also have an angle bracket in `[ESADSE->www.esadse.fr/]` which is supposed to be plain text. If this is anything like XML then that either needs to be replaced with an entity as `[ESADSE->www.esadse.fr/]` or the section needs to be marked as CDATA – Borodin Apr 24 '15 at 17:07
  • no it's not xml. It is not a common data format. And I don't want to parse HTML with regexp, I want to parse my specific format to HTML. – Nooxx Apr 24 '15 at 17:13

2 Answers2

1

s/// replaces what it matched.

You can use

$text =~ s/((?:<.*>)*)(.*)/$1<p>$2<\/p>/g;

Text matched by a look-ahead or a look-behind is not considered part of the match. Neither is the text matched before a \K is encountered.

$text =~ s/(?:<.*>)*\K(.*)/<p>$1<\/p>/g;

The second solution requires Perl 5.10+.

ikegami
  • 367,544
  • 15
  • 269
  • 518
0

Perhaps try using a pattern which only looks for lines that don't start or end with < >. Including \n would also be recommended as you wouldn't want every line only containing a line feed to get <p></p> tags:

$text =~ s/(^[^<\n]+.+|.+[^\/\n>]+$)/<p>$1<\/p>/gm;

Example:

http://ideone.com/p55Ino

l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • Works according to my tests. Good job! – Christopher Bottoms Apr 24 '15 at 17:48
  • I updated the pattern slightly, which should match text that might have `< >` within it... (eg. `Hello > World`). If you want something more general then you could use `^([^\n<>]+)$/

    $1<\/p>/gm;`

    – l'L'l Apr 24 '15 at 17:50
  • 1
    @Nooxx: Keep in mind that this solution is very limited. All elements must be contained within a single line. If that works for you, then great. Otherwise, you may not want to be using regex for it, especially if elements may be nested. – Brian Stephens Apr 24 '15 at 17:58
  • Yes, I'll agree. In light of that, I've further updated the pattern to handle some limited nested occurrences containing `< >` (eg. `<¡Hola mundo!>Nested Text<--`) – l'L'l Apr 24 '15 at 18:06