3

I'm trying to get all CSS files of an html file from URL.

I know that if I want to get the HTML code it is easy - just using PHP function - file_get_contents.

The question is - if I could search easily inside an a URL of HTML and get from there the files or content of all related CSS files?

Note - I want to build an engine for getting a lot of CSS files, this is why just reading the source is not enough..

Thanks,

vsync
  • 118,978
  • 58
  • 307
  • 400
Doron Cohen
  • 245
  • 6
  • 15
  • 1
    Are you trying to use PHP to retrieve a page, then parse the page to get a list of CSS files? If so, how does javacript factor into that? (you tagged your question with javascript) – Chris Baker Sep 11 '13 at 17:55
  • 2
    You'll probably need to load the response from the HTML into a DOM parser and start looking for `link` elements of type `text/css`, extracting the URL from them, and making new `file_get_contents` requests for each of them. Beyond that, you'll also need to parse out embedded `style` tags and inline `style` attributes throughout the HTML. – David Sep 11 '13 at 17:56
  • Josh - I updated my question. of course reading the source is easy but I need it for thousands of websites. – Doron Cohen Sep 11 '13 at 18:00

2 Answers2

8

You could try using http://simplehtmldom.sourceforge.net/ for HTML parsing.

require_once 'SimpleHtmlDom/simple_html_dom.php';

$url = 'www.website-to-scan.com';
$website = file_get_html($url);

// You might need to tweak the selector based on the website you are scanning
// Example: some websites don't set the rel attribute
// others might use less instead of css
//
// Some other options:
// link[href] - Any link with a href attribute (might get favicons and other resources but should catch all the css files)
// link[href="*.css*"] - Might miss files that aren't .css extension but return valid css (e.g.: .less, .php, etc)
// link[type="text/css"] - Might miss stylesheets without this attribute set
foreach ($website->find('link[rel="stylesheet"]') as $stylesheet)
{
    $stylesheet_url = $stylesheet->href;

    // Do something with the URL
}
Tom
  • 3,031
  • 1
  • 25
  • 33
1

You need to parse the HTML tags looking for CSS files. You can do it for example with preg_match - looking for matching regex.

Regex which would find such files might be like this:

\<link .+href="\..+css.+"\>
Kelu Thatsall
  • 2,494
  • 1
  • 22
  • 50
  • Using regex to parse HTML is a recipe for disaster. -1 – Chris Baker Sep 11 '13 at 18:16
  • It was a fast comment, didn't think too much about it and I said it's just an example how you can do it. You are right that's not the best idea, but for simple purposes it's just fine. Imo it's overkill to use simpleHtml if you just need to find something simple. – Kelu Thatsall Sep 11 '13 at 18:35
  • No, it is always a bad idea. HTML is not a regular language, so if you want consistent results (generally what a programmer is shooting for) then you should use the appropriate tools. I agree simpleHTML isn't necessary, since PHP has DomDocument without adding a third party library. However, I don't agree with your sentiment that using bad practices for "simple purposes" is okay. If you want reliable code, you should do it the right way every time. – Chris Baker Sep 11 '13 at 18:38
  • Well I agree with you 100%. My answer here is then wrong, but what I meant by simple purposes is stuff like parsing only one site, which structure you know and you know it doesn't change. For random pages... it's a bad idea, true. – Kelu Thatsall Sep 11 '13 at 18:41
  • I do not agree @Chris Baker. It is still a well defined language and matching a css include is quite simpel and should be favoured over using a dom parser. Even a DOM parser could be wrong when HTML is not valid and then a regex perfoms mostly better. Of course i would invest some time and improve the regex to be a little bit fault tolerant but i would go with that solution. – Robert Sep 01 '19 at 16:57
  • @Rob Parsing HTML is a solved problem. Your time investment in re-solving the problem using tools that are not made for the job is wasted time and money for clients. If a mature, peer reviewed, standardized DOM parser is getting it wrong, how fragile and wrong will your one-off, immature, solo developed, bootleg attempt be? This isn't really a debate subject -- research for yourself the long and well-tread writing on this subject. Rolling your own DOM parser because you don't want to use the established ones for some reason is a bad move. Can you possibly do it? Sure, I guess. Why though? – Chris Baker Sep 04 '19 at 14:58
  • @ChrisBaker i guess you got the conversation wrong. We were not talking about Dom parser but extracting a very easy to identify string. It’s even more performant. – Robert Sep 04 '19 at 16:26
  • @Rob Who is the "we" here? The person asking the question accepted an answer suggesting DOM parsing as well, though they're bringing in a third party library for no reason. *You* are talking about parsing HTML with Regex, whether it's to extract significant information or just a little string, parsing HTML is a solved problem and your homebrewed regex is far, far more likely to miss an edge case than the library with thousands of hours of enhancement and edge case consideration. It's simply not the tool for the job, and you can easily verify that I am not speaking out of turn by saying so. – Chris Baker Sep 10 '19 at 12:22
  • @Rob one of the most famous Stack Overflow questions around deals with exactly this subject: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Chris Baker Sep 10 '19 at 12:23
  • There is a more serious answer on the same question that puts this whole thing to bed: https://stackoverflow.com/a/1758162/610573 – Chris Baker Sep 10 '19 at 12:26