0

I am trying to extract the html content from inside a website. I want only the content inside the tags.

    //$validLink is a link with .htm extension, source code is rather large 
    //contains 24,000 lines of html code

    $thehtml = file_get_contents($validlink);
    $thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);

What else can I do? $thehtml is empty.... I am trying to insert this into a wordpress post... but $thehtml is empty.... for some odd reason. Is there a possible timeout issue or something???

There can't be a timeout issue..... due to the fact that I noticed that if I output just file_get_contents($validlink); for some reason BODY is not found.....

Another possible solution would be just to get the content between the first div and the last div found in the document....

John
  • 13
  • 3

3 Answers3

0

get the string position using 'strpos()' of both tag starting and ending then use sub string method i.e, substr() with this positions

deepi
  • 1,081
  • 10
  • 18
0
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];
Amir
  • 4,089
  • 4
  • 16
  • 28
0

Here is the correct code:

$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];

But I suggest you to use DOM parser instead.

Community
  • 1
  • 1
Randle392
  • 139
  • 2
  • how would you do it with DOM Parser? $thehtml = file_get_contents($validlink); $dumphtml = $thehtml->find('body')->innertext; ??? – John Apr 24 '13 at 03:33