1

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.

Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.

Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.

Source :
...
<TR Class="Head1">
<TD width="15%"><font size="12">Name</font></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">Alex</font></TD>
<TD width="25%"><b>Job</b></TD>
<TD>:&nbsp;</B></TD>
<TD align="center" width="25%"><font color="red">Doctor</font></TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>
...

As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.

How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?

Im waiting for your help.

Maozturk
  • 339
  • 1
  • 5
  • 20
  • 2
    Have you tried to use `DOM`? You can suppress errors using `@` and even if it isn't well formed it still works – Jake N Jan 26 '11 at 23:39

4 Answers4

2

If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.

Richard H
  • 38,037
  • 37
  • 111
  • 138
  • +1 - I added PHP Tidy to my answer when I remembered that TagSoup is in Java (and this question is in PHP) but you had it in your answer first. – Richard JP Le Guen Jan 26 '11 at 23:44
1

You can use DOMDocument to load badly formed HTML:

$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>');


$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
 echo $td->textContent, "\n";
}

I'm suppressing warnings in the above code for brevity.

Output:

Age
: 
32
data
  <!-- space -->
  <!-- space -->

Using regex to parse HTML can be a futile effort as HTML is not a regular language.

webbiedave
  • 48,414
  • 8
  • 88
  • 101
  • As you said, I think regex is not useful for this. Non-well-formed html document could be processed by Tidy and DOM, or only SimpleHTMLDom. – Maozturk Feb 09 '11 at 13:24
0

Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.

You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.

Or you could use a parser.

Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.

Community
  • 1
  • 1
Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119
  • Alright, are TagSoup and Tidy installed on server by default? – Maozturk Jan 27 '11 at 00:06
  • I'm not under the impression as such; as a matter of fact TagSoup is a Java tool (my bad!) although tidy is apparently [bundled with PHP](http://www.php.net/manual/en/tidy.installation.php) – Richard JP Le Guen Jan 27 '11 at 15:15
  • Non-well-formed html document could convert to well-formed html by the Tidy, then DOMDocument could use. Thx for all. – Maozturk Feb 09 '11 at 13:20
0
$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>:&nbsp;</TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD>&nbsp;</B></TD>\s+<TD width="40%">&nbsp;</TD>\s+</TR>
EOF;

preg_match_all($regex, $text, $result);

var_dump($result)
Ming-Tang
  • 17,410
  • 8
  • 38
  • 76