0

I am able to parse HTML page properly, but it is parsing just the data whereas I want to fetch entire HTML code inside in <tr> , <td>. Below is my PHP code:

<?php    
   $dom = new DOMDocument();  

//load the html  
$html = $dom->loadHTMLFile("hydrocarbon.htm");  

  //discard white space   
//$dom->preserveWhiteSpace = false;   

  //the table by its tag name  
$tables = $dom->getElementsByTagName('table');   


    //get all rows from the table  
$rows = $tables->item(0)->getElementsByTagName('tr');   
  // get each column by tag name  
$cols = $rows->item(0)->getElementsByTagName('th');   
$row_headers = NULL;
foreach ($cols as $node) {
    //print $node->nodeValue."\n";   
    $row_headers[] = $node->nodeValue;
}   

$table = array();
  //get all rows from the table  
$rows = $tables->item(0)->getElementsByTagName('tr');   
foreach ($rows as $row)   
{   
   // get each column by tag name  
    $cols = $row->getElementsByTagName('td');   
    $row = array();
    $i=0;
    foreach ($cols as $node) {
        # code...
        //print $node->nodeValue."\n";   
        if($row_headers==NULL)
            $row[] = $node->nodeValue;
        else
            $row[$row_headers[$i]] = $node->nodeValue;
        $i++;
    }   
    $table[] = $row;
}   

//var_dump($table);
print("<pre>".print_r($table,true)."</pre>");
?>

This is my result:

enter image description here

and this is my HTML code:

<table>
<thead>
<tr><th>Column 1</th><th>Column 2</th><th>Column 3</th></tr>
</thead>
<tbody>
<tr> <td><b>Q</b></td><td>Desc.</td> </tr>
<tr> <td>Type</td><td>Multiple choice</td> </tr>
<tr><td>Option</td><td>image #####2</td><td>incorrect</td></tr>
<tr><td>Option</td><td>image #####2</td><td>incorrect</td></tr>
<tr><td>Option</td><td>image #####2</td><td>incorrect</td></tr>
<tr><td>Option</td><td>image #####2</td><td>incorrect</td></tr>

<tr><td>Solution</td><td>Some text / image</td></tr>
<tr><td>Marks</td><td>4</td><td>1</td></tr>
</tbody>
</table>

It is parsing Q and not <b>Q</b>. How can I achieve this?

Edit 1: Original table where your solution should work

<table class=MsoNormalTable border=1 cellspacing=0 cellpadding=0 width=610 style='width:457.25pt;margin-left:10.8pt;background:#CED7E7;border-collapse:
 collapse;border:none'>
    <tr style='height:30.35pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:30.35pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Question<span style='border:none'> </span></span>
                </span>
            </p>
        </td>
        <td width=498 colspan=2 valign=top style='width:373.25pt;border:solid black 1.0pt;
  border-left:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:30.35pt'>
            <p class=MsoNormal style='margin-top:0cm;margin-right:-48.45pt;margin-bottom:
  0cm;margin-left:18.0pt;margin-bottom:.0001pt;line-height:115%;border:none'><b><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>Consider the following reaction,</span></span></b>
            </p>
            <p class=MsoNormal style='margin-top:0cm;margin-right:-48.45pt;margin-bottom:
  0cm;margin-left:18.0pt;margin-bottom:.0001pt;line-height:115%'><b><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>H</span></span></b><b><sub><span lang=EN-US
  style='font-family:"Garamond","serif";border:none'><span style='border:none'>3</span></span></sub></b><b><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>C – CH – CH – CH</span></span></b><b><sub><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>3</span></span></sub></b><b><span lang=EN-US
  style='font-family:"Garamond","serif";border:none'><span style='border:none'>
  + </span></span></b><b><span lang=EN-US style='font-family:"Garamond","serif";
  position:relative;top:2.0pt;border:none'><img width=26 height=29
  src="hydrocarbon2_files/image001.png"></span></b><b><span lang=EN-US
  style='font-family:"Garamond","serif";border:none'><span style='border:none'> &#8594;
  ‘X’  + HBr                                                   </span></span></b>
            </p>
            <p class=MsoNormal style='margin-top:0cm;margin-right:-48.45pt;margin-bottom:
  0cm;margin-left:18.0pt;margin-bottom:.0001pt;line-height:115%'><b><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>            |        |</span></span></b>
            </p>
            <p class=MsoNormal style='margin-top:0cm;margin-right:-48.45pt;margin-bottom:
  0cm;margin-left:18.0pt;margin-bottom:.0001pt;line-height:115%'><b><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>            D       CH</span></span></b><b><sub><span
  lang=EN-US style='font-family:"Garamond","serif";border:none'><span
  style='border:none'>3</span></span></sub></b>
            </p>
            <p class=MsoNoSpacing style='margin-top:0cm;margin-right:-48.45pt;margin-bottom:
  0cm;margin-left:.3pt;margin-bottom:.0001pt;text-align:justify;text-indent:
  -.3pt'><b><span lang=EN-GB style='font-size:16.0pt;font-family:"Chaparral Pro","serif"'>&nbsp;</span></b>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Type</span></span>
            </p>
        </td>
        <td width=498 colspan=2 valign=top style='width:373.25pt;border-top:none;
  border-left:none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>multiple_choice</span></span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Option</span></span>
            </p>
        </td>
        <td width=219 valign=top style='width:164.25pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span style='font-size:16.0pt;color:black;border:none'><img
  width=205 height=93 src="hydrocarbon2_files/image002.jpg"></span>
            </p>
        </td>
        <td width=279 valign=top style='width:209.0pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>I</span></span><span lang=EN-US style='font-size:16.0pt;
  border:none'><span style='border:none'>n<span style='border:none'>correct</span></span>
                </span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Option</span></span>
            </p>
        </td>
        <td width=219 valign=top style='width:164.25pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span style='font-size:16.0pt;border:none'><img width=205
  height=102 id="Picture 13" src="hydrocarbon2_files/image003.jpg"></span>
            </p>
        </td>
        <td width=279 valign=top style='width:209.0pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>C</span></span><span lang=EN-US style='font-size:16.0pt;
  border:none'><span style='border:none'>orrect</span></span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Option</span></span>
            </p>
        </td>
        <td width=219 valign=top style='width:164.25pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span style='font-size:16.0pt;border:none'><img width=205
  height=107 id="Picture 16" src="hydrocarbon2_files/image004.jpg"></span>
            </p>
        </td>
        <td width=279 valign=top style='width:209.0pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Incorrect</span></span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Option</span></span>
            </p>
        </td>
        <td width=219 valign=top style='width:164.25pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span style='font-size:16.0pt;border:none'><img width=205
  height=112 id="Picture 19" src="hydrocarbon2_files/image005.jpg"></span>
            </p>
        </td>
        <td width=279 valign=top style='width:209.0pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Incorrect</span></span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Solution</span></span>
            </p>
        </td>
        <td width=498 colspan=2 valign=top style='width:373.25pt;border-top:none;
  border-left:none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=MsoNormal style='margin-left:27.0pt;text-align:justify;text-indent:
  -27.0pt;line-height:115%'><span style='font-family:"Garamond","serif";
  border:none'><img width=398 height=92 id="Picture 10"
  src="hydrocarbon2_files/image006.jpg"></span>
            </p>
        </td>
    </tr>
    <tr style='height:15.0pt'>
        <td width=112 valign=top style='width:84.0pt;border:solid black 1.0pt;
  border-top:none;background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;
  height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>Marks</span></span>
            </p>
        </td>
        <td width=219 valign=top style='width:164.25pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>4</span></span>
            </p>
        </td>
        <td width=279 valign=top style='width:209.0pt;border-top:none;border-left:
  none;border-bottom:solid black 1.0pt;border-right:solid black 1.0pt;
  background:transparent;padding:4.0pt 4.0pt 4.0pt 4.0pt;height:15.0pt'>
            <p class=BodyA><span lang=EN-US style='font-size:16.0pt;border:none'><span
  style='border:none'>1</span></span>
            </p>
        </td>
    </tr>
</table>
halfer
  • 19,824
  • 17
  • 99
  • 186
user2828442
  • 2,415
  • 7
  • 57
  • 105
  • This answer will help you : https://stackoverflow.com/a/17613826/12232340 –  Jan 19 '20 at 11:10
  • This is not in reference to my requirement, the below answer works with sample html, but if I put my actual html (which I have copied in the end), this solution is not working – user2828442 Jan 19 '20 at 13:00

1 Answers1

0

in your second for loop:

foreach ($rows as $row)   
{   
   // get each column by tag name  
    $cols = $row->getElementsByTagName('td');   
    $row = array();
    $i=0;
    foreach ($cols as $node) {
        # code...

        if($row_headers==NULL)
            $row[] = $node->nodeValue;
        else
            $row[$row_headers[$i]] = $node->firstChild->ownerDocument->saveHTML($node->firstChild);
        $i++;
    }   
    $table[] = $row;
}   

than the output will be:

[1] => Array
(
    [Column 1] => <b>Q</b>
    [Column 2] => Desc.
)
Ronak Dhoot
  • 2,322
  • 1
  • 12
  • 19