1

I am using php, I want to get the content from url in faster way.
Here is a code which I use.
Code:(1)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    echo $content;
?>

Here is many other method to read files like fopen(), readfile() etc. But I think file_get_contents() is faster than these method.

In my above code when you execute it you see that it give every thing from this website even images and ads. I want to get only plan html text no css-style, images and ads. How can I get this.
See this to understand.
CODE:(2)

<?php
    $content = file_get_contents('http://www.filehippo.com');
    // do something to remove css-style, images and ads.
    // return the plain html text in $mod_content.
    echo $mod_content;
?>

If I do that like above then I am going in wrong way, because I already get the full content in variable $content and then modify it.
Can here is any function method or anything else which get the directly plain html text from url.

Below code is written only to understanding, this is not the original php code.
IDEAL CODE:(3);

<?php
    $plain_content = get_plain_html('http://www.filehippo.com');
    echo $plain_content; // no css-style, images and ads.
?>

If I can get this function it will be much faster than others. Can it is possible.
Thanks.

Kailas
  • 3,173
  • 5
  • 42
  • 52
Axeem
  • 670
  • 4
  • 16
  • 26
  • The page `http://www.filehippo.com` has scripts and styles embedded within it already. You can't choose not to download it but you could filter it. – Dave Chen May 27 '13 at 05:02

2 Answers2

4

Try this.

$content = file_get_contents('http://www.filehippo.com');
$this->html =  $content;
$this->process();
function process(){

    // header
    $this->_replace('/.*<head>/ism', "<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE html PUBLIC '-//WAPFORUM//DTD XHTML Mobile 1.0//EN' 'http://www.wapforum.org/DTD/xhtml-mobile10.dtd'><html xmlns='http://www.w3.org/1999/xhtml'><head>");

    // title
    $this->_replace('/<head>.*?(<title>.*<\/title>).*?<\/head>/ism', '<head>$1</head>');

    // strip out divs with little content
    $this->_stripContentlessDivs();

    // divs/p
    $this->_replace('/<div[^>]*>/ism', '') ;
    $this->_replace('/<\/div>/ism','<br/><br/>');
    $this->_replace('/<p[^>]*>/ism','');
    $this->_replace('/<\/p>/ism', '<br/>') ;

    // h tags
    $this->_replace('/<h[1-5][^>]*>(.*?)<\/h[1-5]>/ism', '<br/><b>$1</b><br/><br/>') ;


    // remove align/height/width/style/rel/id/class tags
    $this->_replace('/\salign=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sheight=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\swidth=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sstyle=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\srel=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sid=(\'?\"?).*?\\1/ism','');
    $this->_replace('/\sclass=(\'?\"?).*?\\1/ism','');

    // remove coments
    $this->_replace('/<\!--.*?-->/ism','');

    // remove script/style
    $this->_replace('/<script[^>]*>.*?\/script>/ism','');
    $this->_replace('/<style[^>]*>.*?\/style>/ism','');

    // multiple \n
    $this->_replace('/\n{2,}/ism','');

    // remove multiple <br/>
    $this->_replace('/(<br\s?\/?>){2}/ism','<br/>');
    $this->_replace('/(<br\s?\/?>\s*){3,}/ism','<br/><br/>');

    //tables
    $this->_replace('/<table[^>]*>/ism', '');
    $this->_replace('/<\/table>/ism', '<br/>');
    $this->_replace('/<(tr|td|th)[^>]*>/ism', '');
    $this->_replace('/<\/(tr|td|th)[^>]*>/ism', '<br/>');

    // wrap and close

}
private function _replace($pattern, $replacement, $limit=-1){
    $this->html = preg_replace($pattern, $replacement, $this->html, $limit);
}

for more - https://code.google.com/p/phpmobilizer/

  • 2
    No need to use $this, when it is simple code snippet can be used outside class. Or atleast convert it to example class so unexperienced copy-paste will not make errors. – kuldeep.kamboj May 27 '13 at 06:37
0

you can use regular expression to delete css-script's tags and image's tags, just replace those codes with blank space

preg_replace($pattern, $replacement, $string);

for more detail of function go here: http://php.net/manual/en/function.preg-replace.php

jad-panda
  • 2,509
  • 16
  • 22
  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Dave Chen May 27 '13 at 05:02
  • **jaD** you are asking me like **code(2)** please see my question.Here is reason why This is not good.Thanks. – Axeem May 27 '13 at 05:09
  • @user2280065 , from http://www.filehippo.com you can't choose what to get or what not. whenever you send request to get http://www.filehippo.com page it will send whole page every time. what you can do is something like caching. save most frequent used pages. – jad-panda May 27 '13 at 05:22