2

I'm tasked with a web scraping project. We are pulling a bunch of our static content into a CMS.

HtmlAgilityPack lets me grab dependent resources by looking for anything with a src or http=, but what about css files and their background images? Is there a good utility for parsing css files to get this?

My current solution is a bit of the cthulu way of doing this:

Regex r = new Regex(@"url\(.*\)");
     foreach (var item in r.Matches(cssText))     
     {
    ///scrub url and     
    ///mark img for download
     }
Code Silverback
  • 3,204
  • 5
  • 32
  • 39
  • Not sure this is a good answer, so I'm tentatively putting it in a comment. If I was tasked with this, I'd be tempted to let the browser do the work. Rig up a bookmarklet that fires off some jQuery, traverses the page and spews image URLs into the console. Then copy/paste the console output from wandering around the site into a text file and process that further in a text editor. – izb Jun 21 '11 at 15:29
  • 1
    I almost went down this path, but I wasn't sure how to start reinventing what firebug does. – Code Silverback Jun 21 '11 at 15:35
  • possible duplicate of [Is there a CSS parser for C#?](http://stackoverflow.com/questions/512720/is-there-a-css-parser-for-c) – NotMe Jun 21 '11 at 15:47
  • As far as I can tell, the resources in that question aren't actually much help for getting the values of properties. At least I couldn't get JsonFx's tools to do me any good. – Code Silverback Jun 21 '11 at 15:52

1 Answers1

0

IMO it's not cthulu at all. Your solution sounds good enough for me.. and probably even a good example to use regexp.

NotMe
  • 87,343
  • 27
  • 171
  • 245
duedl0r
  • 9,289
  • 3
  • 30
  • 45
  • Downvote reason: the concepts "regex" and "html parsing" do not belong together. A simple google search of this site will show you the shear number of issues surrounding this. For more information: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – NotMe Jun 21 '11 at 15:45
  • Thanks for pointing that out. I ask you: Do you know the difference between CSS and HTML? Did you read cthulu? – duedl0r Jun 21 '11 at 15:50
  • (edited so I could remove the downvote). In this extremely limited case, it *might* be okay to do this. – NotMe Jun 21 '11 at 16:52