18

I'm trying to save some web pages to text files using PHP scripts.

How can I load a web page into a file buffer with PHP and remove HTML tags?

simhumileco
  • 31,877
  • 16
  • 137
  • 115

3 Answers3

8

One way:

$url = "http://www.brothersoft.com/publisher/xtracomponents.html";
$page = file_get_contents($url);
$outfile = "xtracomponents.html";
file_put_contents($outfile, $page);

The code above is just an example and lacks any(!) error checking and handling.

Shi
  • 4,178
  • 1
  • 26
  • 31
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
2

As the other answers have said, either standard PHP stream functions or cURL is your best bet for retrieving the HTML. As for removing the tags, here are a couple approaches:

Option #1: Use the Tidy extension, if available on your server, to walk through the document tree recursively and return the text from the nodes. Something like this:

function textFromHtml(TidyNode $node) {
    if ($node->isText()) {
        return $node->value;
    } else if ($node->hasChildren()) {
        $childText = '';
        foreach ($node->child as $child)
           $childText .= textFromHtml($child);
        return $childText;
    }
    return '';
}

You might want something more sophisticated than that, e.g., that replaces <br /> tags (where $node->name == 'br') with newlines, but this will do for a start.

Then, load the text of the HTML into a Tidy object and call your function on the body node. If you have the contents in a string, use:

$tidy = new tidy();
$tidy->parseString($contents);
$text = textFromHtml($tidy->body());

Option #2: Use regexes to strip everything between < and >. You could (and probably should) develop a more sophisticated regex that, for example, matched only valid HTML start or end tags. Any errors in the synax of the page, like a stray angle bracket in body text, could mean garbage output if you aren't careful. This is why Tidy is so nice (it is specifically designed to clean up bad pages), but it might not be available.

Tim Yates
  • 5,151
  • 2
  • 29
  • 29
0

I strongly recommend you to take a look at SimpleHTML DOM class;

SimpleHTML DOM Parser at SourceForge

With it you can search the DOM tree using css selectors like with jQuery's $() function or prototypeJS $$() function.

Although it works with file_get_contents() to get content of a web page, you can pass it HTML only with some cURL class of yours ( if you need to login etc. )

Kemo
  • 6,942
  • 3
  • 32
  • 39