Read external HTML page and then find data within

Question

I'm playing around with an idea, and I'm stuck at this one part. I want to read an external HTML page and then extract the data held within two <dd> tags. I've been using file_get_contents with good results, but I'm at a loss as to how to accomplish that last part. The two tags I want to extract the value from are always enclosed within a particular <div>, was wondering if that might help?

In my mind it reads the entire html file into a string, then dumps all the data up until this one particular <div>, and dumps all the data after the closing </div>. Is that possible? I think this needs regex syntax which I've never used yet. So any tips, links, or examples would be great! I can provide more info as necessary.

score 1 · Answer 1 · answered May 19 '10 at 21:39

1

Maybe this could help: http://simplehtmldom.sourceforge.net/

answered May 19 '10 at 21:39

therufa

2,050
2
25
39

score 0 · Answer 2 · answered May 19 '10 at 20:50

0

You are complicating way too much. Simply load the page content and then search for the proper regex (preg_match()). This will do fine

preg_match('~<tag id="foobar">(?P<content>.*?)</endtag>~is', $input, $matches);

answered May 19 '10 at 20:50

Mikulas Dite

7,790
9
59
99

Yes, you could use RegEx to parse HTML, [or not](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – hemp May 19 '10 at 21:41
Everybody knows that html is nonregular language. But the question in fact was: I have a text wrapped with some static phrases, how do I find it? Dom is much slower (and in php is even worse than in other languages) than simple regex. – Mikulas Dite May 20 '10 at 06:51

score 0 · Answer 3 · answered May 21 '10 at 02:04

0

If you use HTQL COM to query the page, the query is: <dd>1:tx

answered May 21 '10 at 02:04

seagulf

380
3
5

Read external HTML page and then find data within

3 Answers3