-3

I have a problem, I have to make a Parser of a Web page. The structure is as follows:

 <TABLE WIDTH=80%>

<tr><td colspan=7><BR><BR></td></tr>
<TR>
<Td colspan=7><FONT FACE="arial" align=left><B><A NAME="TEST">Anagrafica</B><br></TH>
</TR>
<tr><td colspan=7></td></tr>
<TR>
 <TH ALIGN=LEFT ><FONT COLOR="#AA0000" FACE="arial" SIZE="2">Name</FONT></TH>
  <TH></TH>
  <TH ALIGN=LEFT ><FONT COLOR="#AA0000" FACE="arial" SIZE="2">Surname</FONT></TH>
  <TH></TH>
  <TH ALIGN=LEFT ><FONT COLOR="#AA0000" FACE="arial" SIZE="2">ID</FONT></TH>
  <TH></TH>
 <TH ALIGN=LEFT ><FONT COLOR="#AA0000" FACE="arial" SIZE="2">Code</FONT></TH>
 </TR>

 <tr>
 <TD COLSPAN="7">
 <HR SIZE="1" NOSHADE></TD>
 <TR>

 <TR>
   <TD ALIGN="left" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">Mario</FONT>     </TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">&#160;</FONT></TD>
   <TD ALIGN="CENTER" VALIGN="TOP" NOWRAP><P ALIGN="CENTER"><FONT SIZE="1" FACE="arial"> Mario </FONT></TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">&#160;</FONT></TD>
   <TD ALIGN="LEFT" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">1</FONT></TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">a</FONT></TD>
   <TD ALIGN="LEFT" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">132</FONT></TD>

 <TR>
   <TD ALIGN="left" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">Mario</FONT>     </TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">&#160;</FONT></TD>
   <TD ALIGN="CENTER" VALIGN="TOP" NOWRAP><P ALIGN="CENTER"><FONT SIZE="1" FACE="arial"> Mario </FONT></TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">&#160;</FONT></TD>
   <TD ALIGN="LEFT" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">1</FONT></TD>
   <TD WIDTH="10"><VALIGN="TOP"><FONT SIZE="1" FACE="arial">a</FONT></TD>
   <TD ALIGN="LEFT" VALIGN="TOP" NOWRAP><FONT SIZE="1" FACE="arial">132</FONT></TD>

 <TR> 

I want to take the data of the 4 columns using this script

$start = strpos($content,'<Td colspan=7><FONT FACE="arial" align=left><B><A NAME=');
if ($start == TRUE) {
    $end = strpos($content,'</TABLE>',$start) + 8;
    $table = substr($content,$start,$end-$start);
    preg_match_all("|<TD(.*)</TD>|U",$table,$rows);

    $x = 1;
    $counter = 1;
    echo "<table class=\"TFtable\">";
    foreach ($rows[0] as $row){
        if ((strpos($row,'<TR')===false)){
            preg_match_all("|<TD(.*)</TD>|U",$row,$cells);
            $status[$x] = strip_tags($cells[0][0]);
            $x = $x+1;
            $counter = $counter+1;
        }
        if ($counter % 7 == 1) {
            echo "<tr><td>{$status[2]} - {$status[4]} <br> {$status[6]} - {$status[1]}</td></tr>\n";
            $x = 1;
        }
    } 
    echo "</table>";

In this way, however, the last field $ status [1] I will appear in the second row as if indeed it were part of line 2:

example

Mario Rossi 1 213

Mario Bianchi 2 324

Displaying

Mario Rossi 1

Mario Bianchi 2 213

Where am I wrong?

Vincenzo
  • 39
  • 1
  • 5

2 Answers2

1

Try looking into DOMDocument instead of regexing the HTML. With loadHTML() you can let PHP parse the HTML. Look at HTML DOM Document parsing for an example.

Community
  • 1
  • 1
RichardBernards
  • 3,146
  • 1
  • 22
  • 30
0

If you're actually trying to build a parser, you probably don't want to use a pre-built HTML/DOM parser. If this is the case, you'll probably want to follow these steps:

  • tokenize your input (you can use regex for this part)
  • process your tokens, determining each one's meaning as they come.
    • you might want to look into so-called "recursive descent parsers"
    • each token may change the meaning of the following token
    • you may need to look at the next token in line without processing it
  • return your output, most likely in the form of an object that represents a DOM tree

You'll probably need to look into the formal definition of the language to determine what kinds of expressions may follow each other. For instance, the definition of a start tag likely looks something like the following (though this isn't a formal definition, and may contain errors):

'<' + tagName + attributes list + '>'

Again, this is probably wildly inaccurate, and you'll want to look into the formal definition of the language.

The w3c HTML global structure document might be a good place to start.

Ryan Kinal
  • 17,414
  • 6
  • 46
  • 63