regex: put text outside
inside

Question

I have some broken html-code that i would like to fix with regex.

The html might be something like this:

<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>

But there can be much more paragraphs and other html-elements too.

I want to turn in into:

<p>text1</p>
<p>text2</p>
<p>text3</p>
<p>text4</p>
<p>text5</p>

Is this possible with a regex? I'm using php if that matters.

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — annakata, Aug 12 '10 at 13:05
Duplicate of approximately infinite questions. Regex.Parse(HTML) = FAIL — annakata, Aug 12 '10 at 13:06

score 3 · Accepted Answer · answered Aug 12 '10 at 12:36

3

No, this is generally a bad idea with regexes. Regexes don't do stateful parsing. HTML has implicit tags and requires state to be kept to parse.

HTML generally has lots of quirks. It is hard to write an HTML parser as not only you have to keep track of how things should be, but also account for broken behaviour seen in the wild.

Regexes are the wrong tool for this job.

answered Aug 12 '10 at 12:36

szbalint

1,643
12
20

I see. I wrote a parser for it instead, works good. Thanks :) – Martin Aug 12 '10 at 13:31

score 1 · Answer 2 · answered Aug 12 '10 at 12:26

1

Could http://htmlpurifier.org/ help you?

answered Aug 12 '10 at 12:26

Knarf

1,282
3
12
31

Ah, it would probably have been a bit overkill since i only need to solve this specific problem but i will use the htmlpurifier another time :) – Martin Aug 12 '10 at 13:32

score 1 · Answer 3 · answered Aug 12 '10 at 13:03

While regexes are not the best solution for this kind of job, this code works for the example you gave (it might not be optimal!)

<php>

$text = '<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>';

$regex = '|(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)|i';
$replacement = '${1}<p>${3}</p>${4}';
$replacedText =  preg_replace($regex, $replacement, $text);

echo $replacedText;
</php>

in the replacement string, see that you use match 1, 3 and 4 to get the correct sub-matches! If you want to be able to capture other HTML tags then

, you can use this regex:

$regex = '|(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)|i';

but be aware that it can mess stuff up, because the closing tag can match to something different.

Thanks, i followed the advice to not use regex for this but thanks a lot anyway! — Martin, Aug 12 '10 at 13:33

regex: put text outside inside

3 Answers3

regex: put text outside
inside