Analyzing HTML Page

Question

I've got a question that concerns the analyzing of HTML pages. For example there is an page, www.example.com/page.html that contains information in tables that I need, and www.example.com/page2.html has some other information, but in text format. Currently, I'm using an regex (preg_match_all) in which I had to insert a pattern, hand made. Is there a faster/better way to do this. So the full question would be: is there a fast/good way to extract information from an HTML page that doesn't need me to use and edit parts of the source via a regex?

(Other information: I'm using PHP i.c.w. cURL to get the page's content, then I use preg_match_all to extract the data)

score 4 · Accepted Answer · answered Apr 20 '11 at 18:42

Yes! You can load the content of the webpage into a PHP DOMDocument and fetch the data using html classes and IDs just as you would using Javascript.

Here is the documentation http://www.php.net/manual/en/class.domdocument.php

You should start off by using

DOMDocument::loadHTML($html);

Then follow the documentation and it's examples

score 2 · Answer 2 · edited May 23 '17 at 12:03

2

Use any of the parsers suggested in this post. You should never use regular expressions to parse html.

edited May 23 '17 at 12:03

Community

1
1

answered Apr 20 '11 at 18:41

Wes

6,455
3
22
26

score 1 · Answer 3 · answered Apr 20 '11 at 18:40

1

You can use dom.

answered Apr 20 '11 at 18:40

Finbarr

31,350
13
63
94

Analyzing HTML Page

3 Answers3