0

New to bash scripting, The previous answers didn't helped me.

I am trying to harvest ids from web pages and I need to parse page1, get a list of ids, and use them to parse corresponding web pages.

The thing is I'm not sure how to write the script...

Here's what I would like to do:

  1. Parse url1 according to regexp. Output: list of extracted ids (101, 102, 103, etc).
  2. Parse each url with output id, for example: parse (http://someurl/101), then parse (http://someurl/102), etc.

So far, I have come up with this command:

curl http://subtitle.co.il/browsesubtitles.php?cs=movies | grep -o -P '(?<=list.php\?mid=)\d+'  

The command above works, and gives a list of ids.

Any advice for the next steps? Am I on the right track?

Thanks!

Littm
  • 4,923
  • 4
  • 30
  • 38
buntuser
  • 99
  • 7

3 Answers3

0

This is a recursive algorithm, so you need to write a function:

parse_url() {
  ids=$(curl "$1" | grep -o -P '(?<=list.php\?mid=)\d+')
  for id in $ids
  do echo $id
     parse_url "http://someurl/$id"
  done
}

Call this function with the start page, and it will echo all the IDs found on that page, then recursively parse all the http://someurl/ID pages.

This just echoes all the IDs found in all the pages. If there's something else you want to do with the IDs found, you can pipe this script to that. Also, I don't do any duplicate suppression, so it's possible this script could loop forever if there are back-references between pages. You can keep track of the IDs that already have been seen in an array and check this before recursing.

Barmar
  • 741,623
  • 53
  • 500
  • 612
0

You're next step would probably do a loop on all ids:

parse_url () {
    for id in $(grep -o -P '(?<=list.php\?mid=)\d+' "$1"); do
        # Use $id
        url="http://someurl/$id"
        # or parse for the URL with the ID
        url="$(grep -o -P 'http://[a-zA-Z./%0-9:&;=?]*list.php\?mid=$id[a-zA-Z./%0-9:&;=?]*' "$1")"
        # Get page
        new_page_file="$(mktemp)"
        wget -q -O "$new_page_file" "$url"
        # Parse url
        parse_url "$new_page_file"
        # Delete old temporary file
        rm "$new_page_file"
    done
}

wget -q -O file.html http://subtitle.co.il/browsesubtitles.php?cs=movies
parse_url file.html

Here we have defined a function called parse_url, that iterates over all ids it finds in a file passed as an argument (ie. $1 is the first argument passed to the function).

We can then use the ID to generate a URL, or we can grep the URL from the same file, now extracting the ID. Note that the regular expression for finding the URL assumes that the URL has a specific format:

  1. It starts with "http://"
  2. It only contains the characters that are used between the square brackets

To download the page, we create a temporary file with the mktemp command. Since you said you're new to bash scripting, I'll just give a quick explanation for the $(...)s that appears. They run a command or a series of commands that are specified between parenthesis, then execute them, capturing their standard output and placing it where the $(...) was. In this case, it is placed inside the double-quotes that we assign to a $new_page_file variable. Therefore $new_page_file contains the name of a random file name created for storing the temporary file.

We can then download the URL into that temporary file, call the function to parse it, and then delete it.

To call the function initially, we download the initial URL into a file file.html, and then call the function passing that file name as the argument.

EDIT: Added recursion, based on Barmar's answer

Hope this helps a little =)

  • Hi, Janitu, thanks a lot for your reply, both your reply and Barmar's really helped me understand bash functions better. so currently i have the following script: parse_url() { ids=$(curl "$1" | grep -o -P '(?<=list.php\?mid=)\d+') for id in $ids do echo url1 extracted id:$id echo "http://subtitle.co.il/view.php?id=$id&m=subtitles#${id#1}" done } parse_url http://subtitle.co.il/browsesubtitles.php?cs=movies – buntuser Oct 05 '12 at 13:31
  • Great! Glad I could help! If you need anything else, don't hesitate to ask =) – Janito Vaqueiro Ferreira Filho Oct 05 '12 at 13:40
  • so, now that iv'e got the ids, im trying to catch the movie titles. i came up with the regex to do that, and tested it with regexr. but when i run the curl command to test if i get the titles, i get all the page, as if the grep is ignored: curl http://subtitle.co.il/view.php?id=1123506&m=subtitles#123506 | grep -o -P '(?<=style="direction:ltr;" title=")(.*?)(?=">)' + if i put this in a bash script, how do i exit the # sign in the url? thanks – buntuser Oct 06 '12 at 01:15
  • Hi! You should enclose it between double quotes: `curl "subtitle.co.il/view.php?id=1123506&m=subtitles#123506" | ...`. – Janito Vaqueiro Ferreira Filho Oct 06 '12 at 01:35
  • great! thanks! how do i loose all the transaction data that curl throws on screen? for example: % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 2 59181 2 1271 0 0 1735 0 0:00:34 --:--:-- 0:00:34 – buntuser Oct 06 '12 at 02:19
0

This kind of cli html parsing is exactly what I wrote my Xidel for. (and it uses xpath instead of regexp, so you don't summon Cthulhu... (too late, he is already there, i just went to my bathroom and there was this really strange sound...))

If you just need the ids to

use them to parse corrosponding web pages.

you can just follow the links, instead of explicitly extracting the ids.

E.g. Print the titles of all linked pages

xidel 'http://subtitle.co.il/browsesubtitles.php?cs=movies' -f '//a[starts-with(@href,"list.php")]' -e //title

by following all links //a whose destination starts-with(@href,"list.php"). (-f means follow links, -e: means extract data.)

Or if you want extract the large text block on the view url: (don't understand the language, no idea what it is saying...)

xidel 'http://subtitle.co.il/browsesubtitles.php?cs=movies' -f '//a[starts-with(@href,"list.php")]/replace(@href, "list.php[?]mid=", "view.php?id=")' -e 'css("#profilememo")'

Or if you really need the ids separately, you can extract them first:

xidel 'http://subtitle.co.il/browsesubtitles.php?cs=movies' -e '//a[starts-with(@href,"list.php")]/substring-after(@href,"mid=")' -f '//a[starts-with(@href,"list.php")]' -e //title

Or easier with a temporary variable links to store all links:

xidel 'http://subtitle.co.il/browsesubtitles.php?cs=movies' -e '(links:=//a[starts-with(@href,"list.php")])[0]' -e '$links/substring-after(@href,"mid=")' -f '$links' -e //title
Community
  • 1
  • 1
BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • hi Benito, thank you for your help. i decided to try and use your tool, and keep the code gods satisfied :) i downloaded xidel-0.5.src.tar.gz but how do i install and run it? thanks – buntuser Oct 06 '12 at 00:24
  • First, you need FPC and [Lazarus](http://lazarus.freepascal.org/) to compile it. (and on Linux also openssl and [synapse](http://synapse.ararat.cz/doku.php/download)). Then open `programs/internet/xidel/xidel.lpi` in Lazarus and click on "Run". (perhaps you need to compile my internet tools library first, by opening components/pascal/internettools.lpk in Lazarus, and click on "Use >> \ Install".) – BeniBela Oct 06 '12 at 00:53
  • (that might become confusing, if you are not used to Pascal programming. Have you tried the binaries? The Windows versions runs on Windows and in wine. (although with Windows, you need to swap the '-single quotes and "-double quotes)) – BeniBela Oct 06 '12 at 00:54