help with linux shell script using wget and sed

Question

hi can someone assist me with setting up a shell script that does the following?

wget to http://site.com/xap/wp7?p=1
view the html extract all the ProductName's from in between title="Free Shipping ProductName"> ... ex: title="Free Shipping HD7-Case001"> , HD7-Case001 is extracted.
output to products.txt
then loop through the process with step 1. url http://site.com/xap/wp7?p=1 where "1" is page number up to number 50. ex. http://..wp7?p=1, http://..wp7?p=2, http://..wp7?p=3

i've done some research on my own and have this much code written myself... definitely needs a lot more work

#! /bin/sh
... 

while read page; do
wget -q -O- "http://site.com/xap/wp7?p=$page" | 
sed ...

done < "products.txt"

Is there a particular reason you need to do with with wget&sed? — Jason LeBrun, Jan 28 '11 at 07:55
[This way madness lies](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Follow the link that Ignacio gave. — Dennis Williamson, Jan 28 '11 at 08:10
@Jason no reason, i've been googling trying to figure out how to do it on my own and thats what I've come up with so far. — acctman, Jan 28 '11 at 08:23

score 1 · Answer 1 · answered Jan 28 '11 at 08:41

you can combine with PHP for the xml parsing

the wget bash script

#/bin/bash

for page in {1..50}
do
  wget -q -O /tmp/$page.xml "http://site.com/xap/wp7?p=$page"
  php -q xml.php $page >> products.txt
done

xml.php

<?
$file = '/tmp/'.$argv[1].'.xml';
// assumeing the following format
//<Products><Product title="Free Shipping ProductName"/></Products>

$xml = simplexml_load_file($file);
echo $xml->Product->attributes()->title;
/* you can make any replacement only parse/obtain the correct node attribute */
?>

Not a great idea, but PHP simplexml provide some simple way to parse xml.
hope this can be some kick start idea

Dustin Boswell · Accepted Answer · 2011-01-28T19:21:07.143

1

#/bin/bash

for page in {1..50}
do
  wget -q "http://site.com/xap/wp7?p=$page" -O - \
    | tr '"' '\n' | grep "^Free Shipping " | cut -d ' ' -f 3 > products.txt
done

The tr is turning each double-quote into a newline, so the output of tr will be something like:

<html>
...
... <tag title=
Free Shipping [Product]
> ...

Basically, it's a way to put each Product on its own line.

Next, the grep is trying to throw away all the other lines except the ones that start with Free Shipping, so its output should be like:

Free Shipping [Product1]
Free Shipping [Product2]
...

Next, the cut is extracting out the third "column" (delimited by spaces), so the output should be:

[Product1]
[Product2]
...

edited Jan 28 '11 at 19:21

answered Jan 28 '11 at 08:59

Dustin Boswell

6,114
7
28
26

nothing is being outputed. the '\n' is the giving a line return after each output or is it assuming that each Free Shipping is on its own line? – acctman Jan 28 '11 at 09:58
I added further explanation above. Try doing each piece of the command one-by-one to see if it's following the steps above. There is no assumption that the input html has each Free Shipping on its own line, only that the string "Free Shipping [Product]" is truly surrounded by double-quotes. – Dustin Boswell Jan 28 '11 at 19:23

help with linux shell script using wget and sed

2 Answers2