0

I have some broken html-code that i would like to fix with regex.

The html might be something like this:

<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>

But there can be much more paragraphs and other html-elements too.

I want to turn in into:

<p>text1</p>
<p>text2</p>
<p>text3</p>
<p>text4</p>
<p>text5</p>

Is this possible with a regex? I'm using php if that matters.

Martin
  • 5,197
  • 11
  • 45
  • 60
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – annakata Aug 12 '10 at 13:05
  • 1
    Duplicate of approximately infinite questions. Regex.Parse(HTML) = FAIL – annakata Aug 12 '10 at 13:06

3 Answers3

3

No, this is generally a bad idea with regexes. Regexes don't do stateful parsing. HTML has implicit tags and requires state to be kept to parse.

HTML generally has lots of quirks. It is hard to write an HTML parser as not only you have to keep track of how things should be, but also account for broken behaviour seen in the wild.

Regexes are the wrong tool for this job.

szbalint
  • 1,643
  • 12
  • 20
1

Could http://htmlpurifier.org/ help you?

Knarf
  • 1,282
  • 3
  • 12
  • 31
  • Ah, it would probably have been a bit overkill since i only need to solve this specific problem but i will use the htmlpurifier another time :) – Martin Aug 12 '10 at 13:32
1

While regexes are not the best solution for this kind of job, this code works for the example you gave (it might not be optimal!)

<php>

$text = '<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>';

$regex = '|(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)|i';
$replacement = '${1}<p>${3}</p>${4}';
$replacedText =  preg_replace($regex, $replacement, $text);

echo $replacedText;
</php>

in the replacement string, see that you use match 1, 3 and 4 to get the correct sub-matches! If you want to be able to capture other HTML tags then

, you can use this regex:

$regex = '|(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)|i';

but be aware that it can mess stuff up, because the closing tag can match to something different.

Christophe
  • 328
  • 5
  • 15