Ubuntu: how to mass download a field from multiple websites?

Question

I do have permission to do this.

I've got a website with about 250 pages from which I need to download the 'product descriptions' and 'product images'. How do I do it? I'd like to get the data out into a CSV, so that I can put it in a DB table. Could someone point me to a good tutorial to get started on this? I should be using cURL, right?

So far, I got this from another stackoverflow page, How do I transfer wget output to a file or DB?:

curl somesite.com | grep sed etc | sed -e '/^(.*)/INSERT tableName (columnName) VALUES (\1)/' |psql dbname

And I created this, which sucks, to get the images:

#!/bin/bash

lynx --source "www.site.com"|cut -d\" -f8|grep jpg|while read image
do
wget "www.site.com/$image"
done

by watching this video: http://www.youtube.com/watch?v=dMXzoHTTvi0.

If you have permission, wouldn't you have the files locally (i.e. not need to access them as web sites with curl?) — Fosco, Jan 14 '11 at 18:59
If you want cumbersome code, then yes the fiddly curl API is indeed preferrable to PHPs HttpRequest, PEAR Http_Request or Zend_Http. If it's a one time download thing a simple `wget -p http://example.org/prodcuts/*` might be easier. — mario, Jan 14 '11 at 19:01
Perl's `WWW::Mechanize` comes to mind. Probably a better tool for the job than PHP (mainly because CPAN is awesome) — derobert, Jan 14 '11 at 19:04
@Fosco: No. @Mario: Is it possible to go by the DIV or something using wget? — Wolfpack'08, Jan 14 '11 at 19:20
@Fosco: If the data are publicly available you are allowed to do that. — nico, Jan 14 '11 at 19:22
Nope. It only downloads files. You need to postprocess it using phpQuery or QueryPath or [another HTML parser](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) which simplify this very much. — mario, Jan 14 '11 at 19:25

score 1 · Accepted Answer · answered Jan 14 '11 at 19:11

1

You want to do what's called screen scraping.

Here are some links to get you started:

answered Jan 14 '11 at 19:11

Byron Whitlock

52,691
28
123
168

I'd like to do it from command line, and I was under the impression that 'screen scraping' was visual. I'll give these links a look and get back to you, though. Thank you, Byron. – Wolfpack'08 Jan 14 '11 at 19:17
Maybe it's called 'recursive fetching'? – Wolfpack'08 Jan 14 '11 at 19:24
I've reviewed the links and found that many of them lead to code that returns errors. The first link's first code block returns invalid token errors, for example. I hope that you can create a good example, somehow. :) I found one, and I would like to link it in my own answer. I welcome you to attempt to come back first, though. Thank you. – Wolfpack'08 Jan 15 '11 at 23:17

Ubuntu: how to mass download a field from multiple websites?

1 Answers1