How to parse HTML/XML with PHP

Question

From gateway I get one very unsual result it's HTML inside XML, which confuses me. When I echo variable $result this is the output:

<Results>
    <XML_Report>
       <Subject>
         <EFX_Code>199</EFX_Code>
         <Referral>SPECIAL_WOHA</Referral>
       </Subject>
    </XML_Report>
<HTML_Report>
<![CDATA[
        <html>
        <head>


        </head>
        <body>



        <a name="mergereport" />

        <p>MERGE REPORT</p>

        <table border="1" WIDTH="100%" cellpadding=0 cellspacing=0>
        <tr><td class=heading colspan=4 align="center" bgcolor="#c0c0c0"><p class=heading>Personal Information Since 08/09/09 FAD 04/17/12</p></td></tr>
        <tr><td><br /></td><td><br /></td><td width="15%" align=center><p><b>Reported</b></p></td><td align=center><p><b>Bur</b></p></td></tr>
        <tr>
        <td width="15%" valign=top align=right><p class=pipad><b>
        Name<br />
        SSN<br />
        Inquiry SSN<br />
        DOB<br />
        Address
        </b></p></td>
        </tr></table>
        </body>

        </html>
]]>
 </HTML_Report>
</Results>

How can I parse that variable to extract out only part of HTML I want eg. anything withing tags inside with PHP... I've browsed a lot but can't find any proper answer if such parsing is possible and more important HOW?

no. this is the most common question on Stack Overflow. Don't do it this way, use a xml parser. — hackartist, Apr 20 '12 at 02:45
You have to read this : http://stackoverflow.com/a/1732454/14673 — Luc M, Apr 20 '12 at 03:25
This SO answer explains it all for you. http://stackoverflow.com/questions/6674322/how-to-get-values-inside-cdatavalues-using-php-dom — Brian P Johnson, Apr 20 '12 at 04:59
Canonical reference: [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/q/3577641/367456) — hakre, Apr 24 '14 at 13:37

score 2 · Answer 1 · answered Apr 20 '12 at 03:14

2

$doc = new DOMDocument();
$doc->loadHTML($your_html);

Then read up on how to use the DOM library.

answered Apr 20 '12 at 03:14

Anthony

36,459
25
97
163

score 0 · Answer 2 · answered Apr 20 '12 at 02:46

In an ideal world, the XML_Report would be for scripts like your PHP to read, and the HTML_Report would only be for human display. That doesn't, however, appear to be the case from the sample you posted.

You have two parsing tasks here.

First, parse the XML. Navigate within it (via XPath or DOM functions) to the CDATA contents of the HTML_Report element.

Now, the second task: parse the HTML, just as if you'd received it as a raw string.

If what you're asking is "how do I parse HTML using PHP?" there are around 1.874 billion answers on this very site.

score -1 · Answer 3 · answered Apr 20 '12 at 03:12

-1

$html = substr($xml, strpos($xml, '<html>'), 
               strpos($xml, '</html>') - strpos($xml, '<html>') + 7);

answered Apr 20 '12 at 03:12

Jack

5,680
10
49
74

TheOx · Accepted Answer · 2012-04-20T03:04:13.350

-2

A quick and dirty solution:

//Assumes the contents of the xml file are in a string called $xml
$arr = explode("<HTML_Report>", $xml);
if(count($arr) > 1)
{
    $arr2 = explode("</HTML_Report>", $arr[1]);
    $html_portion = $arr2[0];
}

Summary: split the xml string at the HTML_Report start and end tags, each time keeping only the element of the resulting array containing the HTML portion. This will result in $html_portion also containing the CDATA wrapper so if you want to avoid that then split on "".

It ain't elegant but it gets the job done.

EDIT: Fixed code from $xml[1] to $arr[1] - thanks Marc B.

edited Apr 20 '12 at 03:04

answered Apr 20 '12 at 02:52

TheOx

2,208
25
28

using `$xml[1]` would simply be the 2nd char of the entire xml document, since presumably $xml is just a php string... – Marc B Apr 20 '12 at 02:58
@MarcB you're right - typo, supposed to be $arr[1] not $xml[1] – TheOx Apr 20 '12 at 03:04
@TheOx Guessing, but it's probably because `` could occur within the body of another `` tag, so your code isn't actually correct... I personally recommend using a parser to parse structured languages, instead of hacking string manipulations. – Borealid Apr 20 '12 at 15:20
@Borealid - I see what you're saying, although I answered on the assumption that the XML format was pretty much set with what the user posted. But you're right - a parser is generally a more stable and flexible solution. – TheOx Apr 20 '12 at 17:17

How to parse HTML/XML with PHP

4 Answers4