1

hi can someone assist me with setting up a shell script that does the following?

  1. wget to http://site.com/xap/wp7?p=1
  2. view the html extract all the ProductName's from in between title="Free Shipping ProductName"> ... ex: title="Free Shipping HD7-Case001"> , HD7-Case001 is extracted.
  3. output to products.txt
  4. then loop through the process with step 1. url http://site.com/xap/wp7?p=1 where "1" is page number up to number 50. ex. http://..wp7?p=1, http://..wp7?p=2, http://..wp7?p=3

i've done some research on my own and have this much code written myself... definitely needs a lot more work

#! /bin/sh
... 

while read page; do
wget -q -O- "http://site.com/xap/wp7?p=$page" | 
sed ...

done < "products.txt"
acctman
  • 4,229
  • 30
  • 98
  • 142

2 Answers2

1

you can combine with PHP for the xml parsing

the wget bash script

#/bin/bash

for page in {1..50}
do
  wget -q -O /tmp/$page.xml "http://site.com/xap/wp7?p=$page"
  php -q xml.php $page >> products.txt
done

xml.php

<?
$file = '/tmp/'.$argv[1].'.xml';
// assumeing the following format
//<Products><Product title="Free Shipping ProductName"/></Products>

$xml = simplexml_load_file($file);
echo $xml->Product->attributes()->title;
/* you can make any replacement only parse/obtain the correct node attribute */
?>

Not a great idea, but PHP simplexml provide some simple way to parse xml.
hope this can be some kick start idea

ajreal
  • 46,720
  • 11
  • 89
  • 119
1
#/bin/bash

for page in {1..50}
do
  wget -q "http://site.com/xap/wp7?p=$page" -O - \
    | tr '"' '\n' | grep "^Free Shipping " | cut -d ' ' -f 3 > products.txt
done

The tr is turning each double-quote into a newline, so the output of tr will be something like:

<html>
...
... <tag title=
Free Shipping [Product]
> ...

Basically, it's a way to put each Product on its own line.

Next, the grep is trying to throw away all the other lines except the ones that start with Free Shipping, so its output should be like:

Free Shipping [Product1]
Free Shipping [Product2]
...

Next, the cut is extracting out the third "column" (delimited by spaces), so the output should be:

[Product1]
[Product2]
...
Dustin Boswell
  • 6,114
  • 7
  • 28
  • 26
  • nothing is being outputed. the '\n' is the giving a line return after each output or is it assuming that each Free Shipping is on its own line? – acctman Jan 28 '11 at 09:58
  • I added further explanation above. Try doing each piece of the command one-by-one to see if it's following the steps above. There is no assumption that the input html has each Free Shipping on its own line, only that the string "Free Shipping [Product]" is truly surrounded by double-quotes. – Dustin Boswell Jan 28 '11 at 19:23