Parse css for background images

Question

I'm tasked with a web scraping project. We are pulling a bunch of our static content into a CMS.

HtmlAgilityPack lets me grab dependent resources by looking for anything with a src or http=, but what about css files and their background images? Is there a good utility for parsing css files to get this?

My current solution is a bit of the cthulu way of doing this:

Regex r = new Regex(@"url\(.*\)");
     foreach (var item in r.Matches(cssText))     
     {
    ///scrub url and     
    ///mark img for download
     }

Not sure this is a good answer, so I'm tentatively putting it in a comment. If I was tasked with this, I'd be tempted to let the browser do the work. Rig up a bookmarklet that fires off some jQuery, traverses the page and spews image URLs into the console. Then copy/paste the console output from wandering around the site into a text file and process that further in a text editor. — izb, Jun 21 '11 at 15:29
I almost went down this path, but I wasn't sure how to start reinventing what firebug does. — Code Silverback, Jun 21 '11 at 15:35
possible duplicate of [Is there a CSS parser for C#?](http://stackoverflow.com/questions/512720/is-there-a-css-parser-for-c) — NotMe, Jun 21 '11 at 15:47
As far as I can tell, the resources in that question aren't actually much help for getting the values of properties. At least I couldn't get JsonFx's tools to do me any good. — Code Silverback, Jun 21 '11 at 15:52

score 0 · Answer 1 · edited Jun 21 '11 at 16:51

0

IMO it's not cthulu at all. Your solution sounds good enough for me.. and probably even a good example to use regexp.

edited Jun 21 '11 at 16:51

NotMe

87,343
27
171
245

answered Jun 21 '11 at 15:27

duedl0r

9,289
3
30
45

Downvote reason: the concepts "regex" and "html parsing" do not belong together. A simple google search of this site will show you the shear number of issues surrounding this. For more information: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – NotMe Jun 21 '11 at 15:45
Thanks for pointing that out. I ask you: Do you know the difference between CSS and HTML? Did you read cthulu? – duedl0r Jun 21 '11 at 15:50
(edited so I could remove the downvote). In this extremely limited case, it *might* be okay to do this. – NotMe Jun 21 '11 at 16:52

Parse css for background images

1 Answers1