3

I am using Simple Html Dom to get the HTML structure of a webpage. I am also fetching all the external CSS that the page is using. Here is the code:

Class MyClass {

//... Rest of irrelevant code

private function get_web_page($url)
{
        $user_agent='Mozilla/5.0 (Windows NT 6.1; rv:8.0) Gecko/20100101 Firefox/8.0';

        $options = array(

            CURLOPT_CUSTOMREQUEST  => "POST",        //set request type post or get
            CURLOPT_POST           => true,        //set to POST
        CURLOPT_POSTFIELDS     => array(),
            CURLOPT_USERAGENT      => $user_agent, //set user agent
            CURLOPT_COOKIEFILE     => "cookie.txt", //set cookie file
            CURLOPT_COOKIEJAR      => "cookie.txt", //set cookie jar
            CURLOPT_RETURNTRANSFER => true,     // return web page
        CURLOPT_BINARYTRANSFER => true,
            CURLOPT_HEADER         => false,    // don't return headers
            CURLOPT_FOLLOWLOCATION => true,     // follow redirects
            CURLOPT_ENCODING       => "",       // handle all encodings
            CURLOPT_AUTOREFERER    => true,     // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,      // timeout on connect
            CURLOPT_TIMEOUT        => 120,      // timeout on response
            CURLOPT_MAXREDIRS      => 10,       // stop after 10 redirects
        );

        $ch      = curl_init( $url );
        curl_setopt_array( $ch, $options );
        $content = curl_exec( $ch );
        $err     = curl_errno( $ch );
        $errmsg  = curl_error( $ch );
        $header  = curl_getinfo( $ch );
        curl_close( $ch );

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['content'] = $content;

        return $header;
}

private function collect_css($url,$html)
{
            $css = array();
        foreach($html->find('link') as $e){
            $css[] = file_get_contents($e,true); //Consider all as absolute URL
        }
        return $css;
}

private function collect_inlinecss($url,$html)
{
            $css = array();
        foreach($html->find('style') as $e){
            $css = $e->innertext //Get inline CSS
        }
        return $css;
}

private function filter_css($css)
{
/* What should I place here to get only certain attributes (for ex- 'display' attribute only for this case)
 * For example- if $css = #selector{ display : block; color: blue },
 * the function should return only $css = #selector{ display : block; }
 */

}

public function index(){

$url =  "http://www.example.com";
$raw = $this->get_web_page($url);
$html = str_get_html($raw['content']); //Get only HTML content using Simple HTML Dom Lib
$css = $this->collect_css($url,$html); //Get all external CSS files of webpage
$css_inline = $this->collect_inlinecss($url,$html); //Get inline CSS (<style>....</style>)
$css_filtered = $this->filter_css($css);

var_dump($css_filtered); //See next for how I want it to look like
}

The var_dump must contain the stripped Css. The desired output for sample input Css should look like:

Input CSS(the input for filter function):
#id{
  display: block;
  color: blue;
  padding: 0 5px;
}

#id2{
 background: Yellow;
 margin: 0px;
 position: relative;
}

#id3{ float: left; }

Output Css (the expected result from var_dump):
/* I wish to strip off every style except 'display' and 'position' */
#id{
  display: block;
}

#id2{
 position: relative;
}

Can anyone enlighten me with some ray of hope or anything. I know that regex would do some help, though I am not good at that, nor do I know any good plugins out there. PS: Those who are here to say that I haven't googled before I asked- I have spent 1 hours going threw questions like this, this, but could not find any decent solutions. Please help.

Thanks

Community
  • 1
  • 1
ashutosh
  • 1,192
  • 5
  • 22
  • 46
  • 1
    Have you tried any CSS parser? Like https://github.com/sabberworm/PHP-CSS-Parser ? – Glavić Dec 19 '13 at 15:45
  • 1
    you could try using a css parser - [see this question](http://stackoverflow.com/questions/236979/parsing-css-by-regex) – Pete Dec 19 '13 at 15:45

2 Answers2

0

If you want the included css-files, just search for <link rel="stylesheet" src="source">, otherwise you have to search for inline style-elements.

For the regex part, this has been a really useful website for me ;)

RegExr

L00_Cyph3r
  • 669
  • 4
  • 18
  • I can get the CSS files without any problem. However the problem arises when it need to parse those fetched CSS. So I needed the regex solutions. Well thanks for the link, but I already had been there. But since I am less familiar with regex part, so I could not make proper use of that. – ashutosh Dec 19 '13 at 18:33
0

Thanks @Pete so much for the awesome link he shared. Following code snippets solved my problems. You can choose any of the regex/function.

private function parse_css($css)
{
    $css_array = array(); // master array to hold all values
    $element = explode('}', $css);
    foreach ($element as $element) {
        // get the name of the CSS element
        $a_name = explode('{', $element);
        $name = $a_name[0];
        // get all the key:value pair styles
        $a_styles = explode(';', $element);
        // remove element name from first property element
        $a_styles[0] = str_replace($name . '{', '', $a_styles[0]);
        // loop through each style and split apart the key from the value
        $count = count($a_styles);
        for ($a=0;$a<$count;$a++) {
            if ($a_styles[$a] != '') {
                $a_key_value = explode(':', $a_styles[$a]);
                // build the master css array
                $css_array[$name][$a_key_value[0]] = $a_key_value[1];
            }
        }               
    }
    return $css_array;
}

private function filter_css($css)
{
//$regex1 = '/([^{]+)\s*\{\s*([^}]+)\s*}/';  
//$regex2 = '/(?<selector>(?:(?:[^,{]+),?)*?)\{(?:(?<name>[^}:]+):?(?<value>[^};]+);?)*?\}/';
  $newcss = $this->parse_css($css);  //One way
//preg_match_all('regex1',$css,$newcss);  //another way
//preg_match_all('regex2',$css,$newcss);  //yet another way
  return $newcss;
}

The above function uses primarily 2 regex and 1 function to parse CSS, making total 3 methods available to use. Every method has different output orders.

For example, consider following CSS file:

#id{ display: block; color: blue; padding: 0 5px;}

#id2{ background: Yellow; margin: 0px; position: relative;}

#id3{ float: left;}

You will get following outputs when:

  1. You use parse_css function:
    #output:
    Array
    (
        [#id] => Array
            (
                [ display] =>  block
                [ color] =>  blue
                [ padding] =>  0 5px
            )



     [#id2] => Array
            (
                [ background] =>  Yellow
                [ margin] =>  0px
                [ position] =>  relative
            )

    [#id3] => Array
        (
            [ float] =>  left
        )

    )
  1. You use regex1:
    #output:
    Array
    (
        [0] => Array
            (
                [0] => #id{ display: block; color: blue; padding: 0 5px;}
                [1] => #id2{ background: Yellow; margin: 0px; position: relative;}
                [2] =>  #id3{ float: left;}
            )

        [1] => Array
            (
                [0] => #id
                [1] =>  #id2
                [2] => #id3
            )

        [2] => Array
            (
                [0] => display: block; color: blue; padding: 0 5px;
                [1] => background: Yellow; margin: 0px; position: relative;
                [2] => float: left;
            )

    )
  1. You use regex2:
    #output:
    Array
    (
        [0] => Array
            (
                [0] => #id{ display: block; color: blue; padding: 0 5px;}
                [1] => #id2{ background: Yellow; margin: 0px; position: relative;}
                [2] => #id3{ float: left;}
            )

        [selector] => Array
            (
                [0] => #id
                [1] => #id2
                [2] => #id3
            )

        [1] => Array
            (
                [0] => #id
                [1] => #id2
                [2] => #id3
            )

        [name] => Array
            (
                [0] =>  padding
                [1] =>  position
                [2] =>  float
            )

        [2] => Array
            (
                [0] =>  padding
                [1] =>  position
                [2] =>  float
            )

        [value] => Array
            (
                [0] =>  0 5px
                [1] =>  relative
                [2] =>  left
            )

        [3] => Array
            (
                [0] =>  0 5px
                [1] =>  relative
                [2] =>  left
            )

    )

You can see that regex2 gives you more flexibility and options than others. I would also personally not recommend using parse_css function for it has the bug that would omit error in case your css has closing braces contain spaces. For example, if your CSS is:

#ie{ display:block; } /* notice the space after ';' */

it will give your following error: Notice: Undefined offset: 1 in path/to/your/file.php on line xx

but runs fine if no spaces occurs. For example, when the above CSS is written as:

#ie{ display:block;} /* notice no space after ';' */

and it will run fine.

Now you can strip out any result you want. You can manually design any filter scheme or custom regex. Suggestions are welcome.

ashutosh
  • 1,192
  • 5
  • 22
  • 46