1

I'm trying to scrape web pages.

I want to download a web page by providing its URL and save it for offline reading with all its images. I can't manage to do that with wget since it creates many directories.

Is this possible with wget? Is there something like the "Save as" option in FireFox which creates a directory and puts all required resources into that with an HTML page?

Would it be possible to do this Nokogiri or Mechanize?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
don ali
  • 157
  • 2
  • 16
  • This SO thread might get you started: http://stackoverflow.com/questions/4217223/how-to-get-the-html-source-of-a-webpage-in-ruby – orde May 09 '13 at 19:28
  • thanks but it does not say how to download the pictures. I'd like to download the page with the pictures for reading even without internet connection. – don ali May 09 '13 at 19:51
  • Another thread: http://stackoverflow.com/questions/1074309/how-to-download-a-picture-using-ruby – orde May 09 '13 at 20:13
  • Of course it's possible using Nokogiri and a couple other gems, like Open::URI or Net::HTTP, but you'll have to write the code telling them what to do because they don't do it by themselves. The bigger trick is to rewrite the HTML page to load all the resources from your disk instead of the remote site. – the Tin Man May 09 '13 at 21:29
  • it would be a mess to rewrites urls for local use. I think the best option is wget but it does not get content in one single directory. – don ali May 09 '13 at 21:35
  • 1
    http://entrenchant.blogspot.com/2012/02/web-page-mirroring-wget-in-ruby.html has code for this purpose. – the Tin Man May 10 '13 at 05:12
  • Hi Tin Man, that was the perfect solution and made my job. How should I mark it as the aswer? – don ali May 10 '13 at 06:54

2 Answers2

2

You can use wget to do this and run it from within your ruby script.

Here's example that will rip the homepage of my site, skrimp.ly, and put the contents into a single directory named "download". Everything will be at the top level and the links embedded in the HTML will be rewritten to be local:

wget -E -H -k -K -p -nH -nd -Pdownload -e robots=off http://skrimp.ly

Note: you should checkout some of the docs for wget. It can do some really crazy stuff like go down multiple levels. If you do that sort of thing please be cautious -- it can be pretty heavy on a web server and in some cases cost the webmaster a lot of $$$$.

http://www.gnu.org/software/wget/manual/html_node/Advanced-Usage.html#Advanced-Usage

Mario Zigliotto
  • 8,315
  • 7
  • 52
  • 71
  • I've tried this command. When I disconnect from Internet and try to open the downloaded shtml, images are reloaded.so I did this: wget -p --convert-links -nH -nd -Pdownloads http://www.bbc.co.uk/persian/world/2013/05/130509_an_buddhist_monks_attack_muslims.shtml It creates "downloads" directory with three files: 1-html file, 2- bump?emp=worldwide, 3-robots.txt – don ali May 10 '13 at 03:57
  • Great! If you end up using it, please select my answer. Thanks! – Mario Zigliotto May 10 '13 at 17:43
2

The answer given by the Tin Man did the job. This shows how to use Nokogiri to download a single page with pictures for offline reading with a very clean directory structure.

don ali
  • 157
  • 2
  • 16