I'm building a sharing site which allows to share webpage links with Ruby on Rails.
I would like to extract some representative images for each page (as on Facebook when you share a link).
For now I use the gem opengraph to parse og:image
meta tag at first, and then I use Nokogiri to parse the page content and retrieve all <img>
tags src
attributes. This give good results (except some decoration images, so I filter results by size...).
--
Now I would like to go further and parse css background-image
property : websites logo are often display as background for a <h1>
or a <a>
tag.
I think about the following process:
Parse HTML document with regex (something like
/background(-image)?:.../
) to find inline CSSRetrieve CSS stylesheets URLs with Nokogiri and parse these sheets with the same regex
... and absolutify URLs according to documents URLs.
--
My questions are :
Do you think there is a better alternative ?
Is there a library of some sort that can increase the performance of the process ?
For example, if I could build a consolidated view of HTML+CSS, which allows me to access CSS properties via the DOM, I could access only the background-images of pre-selected HTML elements (h1,a,...) and limit the number of results.