get content inside html not working

Question

I am trying to extract the html content from inside a website. I want only the content inside the tags.

    //$validLink is a link with .htm extension, source code is rather large 
    //contains 24,000 lines of html code

    $thehtml = file_get_contents($validlink);
    $thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);

What else can I do? $thehtml is empty.... I am trying to insert this into a wordpress post... but $thehtml is empty.... for some odd reason. Is there a possible timeout issue or something???

There can't be a timeout issue..... due to the fact that I noticed that if I output just file_get_contents($validlink); for some reason BODY is not found.....

Another possible solution would be just to get the content between the first div and the last div found in the document....

Use a DOM parser, not regexp, to extract information from HTML. — Barmar, Apr 23 '13 at 05:46

deepi · Accepted Answer · 2013-04-24T05:28:01.723

0

get the string position using 'strpos()' of both tag starting and ending then use sub string method i.e, substr() with this positions

edited Apr 24 '13 at 05:28

answered Apr 23 '13 at 05:26

deepi

1,081
10
18

Thanks I was able to make a word around and use the substr() and strpos() etc. – John Apr 24 '13 at 04:45

score 0 · Answer 2 · answered Apr 23 '13 at 05:47

0

$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];

answered Apr 23 '13 at 05:47

Amir

4,089
4
16
28

score 0 · Answer 3 · edited May 23 '17 at 12:12

0

Here is the correct code:

$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];

But I suggest you to use DOM parser instead.

edited May 23 '17 at 12:12

Community

1
1

answered Apr 23 '13 at 05:53

Randle392

139
2

how would you do it with DOM Parser? $thehtml = file_get_contents($validlink); $dumphtml = $thehtml->find('body')->innertext; ??? – John Apr 24 '13 at 03:33

get content inside html not working

3 Answers3