0

I do have permission to do this.

I've got a website with about 250 pages from which I need to download the 'product descriptions' and 'product images'. How do I do it? I'd like to get the data out into a CSV, so that I can put it in a DB table. Could someone point me to a good tutorial to get started on this? I should be using cURL, right?

So far, I got this from another stackoverflow page, How do I transfer wget output to a file or DB?:

curl somesite.com | grep sed etc | sed -e '/^(.*)/INSERT tableName (columnName) VALUES (\1)/' |psql dbname

And I created this, which sucks, to get the images:

#!/bin/bash

lynx --source "www.site.com"|cut -d\" -f8|grep jpg|while read image
do
wget "www.site.com/$image"
done

by watching this video: http://www.youtube.com/watch?v=dMXzoHTTvi0.

Community
  • 1
  • 1
Wolfpack'08
  • 3,982
  • 11
  • 46
  • 78
  • If you have permission, wouldn't you have the files locally (i.e. not need to access them as web sites with curl?) – Fosco Jan 14 '11 at 18:59
  • 2
    If you want cumbersome code, then yes the fiddly curl API is indeed preferrable to PHPs HttpRequest, PEAR Http_Request or Zend_Http. If it's a one time download thing a simple `wget -p http://example.org/prodcuts/*` might be easier. – mario Jan 14 '11 at 19:01
  • Perl's `WWW::Mechanize` comes to mind. Probably a better tool for the job than PHP (mainly because CPAN is awesome) – derobert Jan 14 '11 at 19:04
  • @Fosco: No. @Mario: Is it possible to go by the DIV or something using wget? – Wolfpack'08 Jan 14 '11 at 19:20
  • @Fosco: If the data are publicly available you are allowed to do that. – nico Jan 14 '11 at 19:22
  • Nope. It only downloads files. You need to postprocess it using phpQuery or QueryPath or [another HTML parser](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html) which simplify this very much. – mario Jan 14 '11 at 19:25

1 Answers1

1

You want to do what's called screen scraping.

Here are some links to get you started:

Byron Whitlock
  • 52,691
  • 28
  • 123
  • 168
  • I'd like to do it from command line, and I was under the impression that 'screen scraping' was visual. I'll give these links a look and get back to you, though. Thank you, Byron. – Wolfpack'08 Jan 14 '11 at 19:17
  • Maybe it's called 'recursive fetching'? – Wolfpack'08 Jan 14 '11 at 19:24
  • I've reviewed the links and found that many of them lead to code that returns errors. The first link's first code block returns invalid token errors, for example. I hope that you can create a good example, somehow. :) I found one, and I would like to link it in my own answer. I welcome you to attempt to come back first, though. Thank you. – Wolfpack'08 Jan 15 '11 at 23:17