3

I'm building a sharing site which allows to share webpage links with Ruby on Rails.

I would like to extract some representative images for each page (as on Facebook when you share a link).

For now I use the gem opengraph to parse og:image meta tag at first, and then I use Nokogiri to parse the page content and retrieve all <img> tags src attributes. This give good results (except some decoration images, so I filter results by size...).

--

Now I would like to go further and parse css background-image property : websites logo are often display as background for a <h1> or a <a> tag.

I think about the following process:

  • Parse HTML document with regex (something like /background(-image)?:.../) to find inline CSS

  • Retrieve CSS stylesheets URLs with Nokogiri and parse these sheets with the same regex

... and absolutify URLs according to documents URLs.

--

My questions are :

  • Do you think there is a better alternative ?

  • Is there a library of some sort that can increase the performance of the process ?

    For example, if I could build a consolidated view of HTML+CSS, which allows me to access CSS properties via the DOM, I could access only the background-images of pre-selected HTML elements (h1,a,...) and limit the number of results.

Thomas Guillory
  • 5,719
  • 3
  • 24
  • 47
  • Actually the image which is shown is defined in the `meta` tag with the property `og:image` and `img` tags are just used if it isn't defined. – noob Apr 19 '12 at 13:04
  • @micha Yes I know, I parse it too. But the great majority of websites are not tagged with OpenGraph. – Thomas Guillory Apr 19 '12 at 13:17
  • Not too familiar with Ruby on Rails, but I would avoid using regex to parse HTML or CSS. http://stackoverflow.com/a/1732454/522877 – Wex Apr 19 '12 at 20:04
  • @Wex Hm yes I'm agree but according to [this answer](http://stackoverflow.com/a/1733489/1089771) parsing a well-known subset of HTML (not arbitrary HTML doc) is a good job for regex. In my case I just want to match /background: url(...)/, and not nested tags. – Thomas Guillory Apr 19 '12 at 21:15

1 Answers1

1

When you parse the CSS of a web site, any images you are going to get back are going to be related to the user interface (sprites, backgrounds), not the actual content of the page.

I don't think it would be worth your while unless you're just trying to extract logos. In that case I would restrict to matches on class names/ids/paths containing the word "logo".

If you want to extract "representative images" from a page, I would just parse the image tags as you are doing then generate (and crop) a screenshot of the page as per: How do I take screenshots of web pages using ruby and a unix server?

How are you handling images that aren't in the raw HTML source?

In terms of libraries, I'm pretty sure nokogiri is the best thing out there.

Community
  • 1
  • 1
Paul McClean
  • 101
  • 1
  • 7
  • Thanks for the answer, and for the link. It forwards me to Selenium, which could be interesting. >> How are you handling images that aren't in the raw HTML source? I don't, I only parse the raw HTML with Nokogiri. What are you thinking about ? Images loaded in DOM with JS ? – Thomas Guillory Apr 20 '12 at 15:42
  • Yeah, I was thinking sites about that load their image assets via javascript (lazyloading for example) or slideshows. – Paul McClean Apr 20 '12 at 16:39