0

Possible Duplicates:
Preg match text in php between html tags
RegEx match open tags except XHTML self-contained tags

I have a large amount of text formatted in the following way:

    <P><B>1- TITLE</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text
    </DL><P>
    <P><B>2 - Title 2</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text Text text text text text
text text Text text text text text
text text
    <br><I>Additional irrelevant information</I>
    </DL><P>

I'm trying to use PHP's Regexp functions to retrieve the Title-Text value pairs while stripping out the extra   characters as well as the irrelevant info that follows some of the text blocks. Preferably I'd like to:

Grab everything between <P><B> and </B> as the title

Grab all the text between

<DL><DD>&nbsp;&nbsp;&nbsp;

and the next HTML tag (<) as the text, and somehow keep the two associated together for further processing. Any idea how to do this with PHP's Regexp functions?

Community
  • 1
  • 1
MarathonStudios
  • 3,983
  • 10
  • 40
  • 46
  • 1
    \*sigh\* I wonder if these questions are ever going to stop. – Tomalak Feb 19 '11 at 09:00
  • 2
    @Tomalak, you wish! Just find a similar question, and vote to close. Preferably before someone comes along and either posts a link **the** XHTML-regex answer, or the "blah regex blah 2 problems"-quote. – Bart Kiers Feb 19 '11 at 09:11
  • possible duplicate of ["RegEx match open tags except XHTML self-contained tags"](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), ["Need a regular expression to parse HTML tags"](http://stackoverflow.com/questions/3577591/need-a-regular-expression-to-parse-html-tags), ["Regular Expression in PHP: How to create a pattern for tables in html"](http://stackoverflow.com/questions/1902115/regular-expression-in-php-how-to-create-a-pattern-for-tables-in-html) – outis Feb 19 '11 at 09:25
  • ... ["find-and-replace-in-html regular expression fails"](http://stackoverflow.com/questions/3139191/find-and-replace-in-html-regular-expression-fails), ["Php regular expression to match a div"](http://stackoverflow.com/questions/2947360/php-regular-expression-to-match-a-div) – outis Feb 19 '11 at 09:33

1 Answers1

1

As the comments on your question suggest, questions along the same lines are frequently asked on Stack Overflow, and the right answer is generally "Don't try to parse HTML with regular expressions". As well as making that point, however, I think it's useful to have an example in the answer of showing how one might take the suggested approach. For the case in your question, one could do:

<?php

$html = <<<EOF
    <P><B>1- TITLE</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text
    </DL><P>
    <P><B>2 - Title 2</B>
    <P>
    <DL><DD>&nbsp;&nbsp;&nbsp; Text text text text text
text text Text text text text text
text text Text text text text text
text text
    <br><I>Additional irrelevant information</I>
    </DL><P>
EOF;

$d = new DomDocument;
$d->loadHtml($html);

$xp = new DomXpath($d);

$matches = $xp->query("//p/b", $d);
foreach ($matches as $dn) {
    echo "Title is: " . $dn->nodeValue . "\n";
    $dl = $dn->parentNode->nextSibling->nextSibling->firstChild;
    $dd = $dl->firstChild;
    echo "Content is: " . $dd->nodeValue . "\n";
}
?>

Depending on how robust you need this to be, you would probably want to check that the nextSiblings and children are tags with the name you expect, but this shows the idea anyway.

Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • Thanks Mark, I never thought of using a DOM model to parse it (I'm an ASP programmer). You saved me alot of time trying to mess around with regular expressions! – MarathonStudios Feb 19 '11 at 13:02