How Can I Get Data From HTML Source Code with PHP and RegEx?

Question

I have got HTML source code, and i must get some information text in the HTML. I can not use DOM, because the document isn't well-formed.

Maybe, the source could change later, I can not be aware of this situation. So, the solution of this problem must be advisible for most situation.

Im getting source with curl, and i will edit it with preg_match_all function and regular expressions.

Source :
...
<TR Class="Head1">
<TD width="15%">Name</TD>
<TD>: </TD>
<TD align="center">Alex</TD>
<TD width="25%">Job</TD>
<TD>: </TD>
<TD align="center" width="25%">Doctor</TD>
</TR>
...
...
<TR Class="Head2">
<TD width="15%" align="left">Age</TD>
<TD>: </TD>
<TD align="center">32</TD>
<TD width="15%">data</TD>
<TD> </TD>
<TD width="40%"> </TD>
</TR>
...

As we have seen, the source is not well-formed. In fact, terrible! But there is nothing I can do. The source is longer than this.

How can I get the data from the source? I can delete all of HTML codes, but how can i know sequence of data? What can I do with preg_match_all and regex? What else can I do?

Im waiting for your help.

Have you tried to use `DOM`? You can suppress errors using `@` and even if it isn't well formed it still works — Jake N, Jan 26 '11 at 23:39

score 2 · Accepted Answer · answered Jan 26 '11 at 23:39

2

If you can use the DOM this is far better than regexes. Take a look a PHP Tidy - it's designed to manage badly formed HTML.

answered Jan 26 '11 at 23:39

Richard H

38,037
37
111
138

+1 - I added PHP Tidy to my answer when I remembered that TagSoup is in Java (and this question is in PHP) but you had it in your answer first. – Richard JP Le Guen Jan 26 '11 at 23:44

score 1 · Answer 2 · answered Jan 27 '11 at 00:18

You can use DOMDocument to load badly formed HTML:

$doc = new DOMDocument();
@$doc->loadHTML('<TR Class="Head2">
<TD width="15%" align="left">Age</B></TD>
<TD>:&nbsp;</TD>
<TD align="center"><font color="red">32</font></TD>
<TD width="15%"><font size="10">data</TD></font>
<TD>&nbsp;</B></TD>
<TD width="40%">&nbsp;</TD>
</TR>');


$tds = @$doc->getElementsByTagName('td');
foreach ($tds as $td) {
 echo $td->textContent, "\n";
}

I'm suppressing warnings in the above code for brevity.

Output:

Age
: 
32
data
  <!-- space -->
  <!-- space -->

Using regex to parse HTML can be a futile effort as HTML is not a regular language.

As you said, I think regex is not useful for this. Non-well-formed html document could be processed by Tidy and DOM, or only SimpleHTMLDom. — Maozturk, Feb 09 '11 at 13:24

score 0 · Answer 3 · edited May 23 '17 at 12:31

0

Don't use RegEx. The link is funny but not informative, so the long and short of it is that HTML markup is not a regular language, hence cannot be parsed simply using regular expressions.

You could use RegEx to parse individual 'tokens' ( a single open tag; a single attribute name or value...) as part of a recursive parsing algorithm, but you cannot use a magic RegEx to parse HTML all on its own.

Or you could use a parser.

Since the markup isn't valid, maybe you could use TagSoup or PHP:Tidy.

edited May 23 '17 at 12:31

Community

1
1

answered Jan 26 '11 at 23:38

Richard JP Le Guen

28,364
7
89
119

Alright, are TagSoup and Tidy installed on server by default? – Maozturk Jan 27 '11 at 00:06
I'm not under the impression as such; as a matter of fact TagSoup is a Java tool (my bad!) although tidy is apparently [bundled with PHP](http://www.php.net/manual/en/tidy.installation.php) – Richard JP Le Guen Jan 27 '11 at 15:15
Non-well-formed html document could convert to well-formed html by the Tidy, then DOMDocument could use. Thx for all. – Maozturk Feb 09 '11 at 13:20

score 0 · Answer 4 · answered Jan 26 '11 at 23:42

$regex = <<<EOF
<TR Class="Head2">\s+<TD width="15%" align="left">Age</B></TD>\s+<TD>:&nbsp;</TD>\s+<TD align="center"><font color="red">(\d+)</font></TD>\s+<TD width="15%"><font size="10">(\w+)</TD></font>\s+<TD>&nbsp;</B></TD>\s+<TD width="40%">&nbsp;</TD>\s+</TR>
EOF;

preg_match_all($regex, $text, $result);

var_dump($result)

How Can I Get Data From HTML Source Code with PHP and RegEx?

4 Answers4