html parsing with grep and regex

Question

I'm making a shell script that gets a mountain (only over 8000m) as a parameter and returns the name or names of those who were the first to climb it. I found a page from where i can parse my info which i can download with curl but i don't really know my way too well around regex ... can anyone help me from a html code like this given the mountains name how can i get the climbers ... thx anticipated

site: http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/

html sample

    <p class="wp-caption-text">Everest</p></div></div></div><p><strong>Other names: </strong>Sagamartha, Chomolangma or Qomolangma<br
/> <strong>Altitude:</strong> 8848 m<br
/> <strong>Location: </strong>Tibet / Nepal<br
/> <strong>First ascent:</strong> May 29, 1953 by Sir Edmund Hillary and Tenzing Norgay<br
/> <strong>Expedition</strong><strong>: </strong>New Zeeland/India</p><blockquote><p>&nbsp;</p><p><strong>Difficulty</strong> : <em>Mostly a non-technical climb regardless on which of the two normal routes you choose. On the south you have to deal with a dangerous ice fall and The Hillary Step, a short section of rock, on the north side there are some short technical passages. On both routes (permanent) fixed ropes are placed at the tricky sections. The altitude is main obstacle. Nowadays also crowding is mentioned as a factor of difficulty</em>.</p>

found another site maybe it's easier: http://www.alpineascents.com/8000m-peaks.asp

html sample

<tr>
         <td><strong>Everest</strong></td>
         <td>8,850m <br /></td>
         <td>29,035ft</td>
         <td><div align="center">Nepal/Tibet </div></td>
         <td>1953; Sir E. Hillary, T. Norgay</td>
       </tr>

Scraping a website string based is a very unstable approach. Scraping is never a replacement for an API, but you will get better (more stable) results when you parse the DOM tree instead of parsing the html by yourself. Take a look at the DOM tree parsers for php, they offer easy access to single elements and their attributes. — arkascha, Mar 03 '14 at 14:53
[Can't parse html with regex](http://stackoverflow.com/a/1732454/7552) — glenn jackman, Mar 03 '14 at 14:58
@La-comadreja thx for the tip but the problems is that we have it as an assignment and the recuirment is that we do it as an SMS (max 160 characters) and i don't think it's possible with java :P that's why i started with shell because this is the shortest way :) — spd92, Mar 03 '14 at 15:00
@glennjackman maybe i'm not completely familiar with regex but 'curl -s championshiphistory.com/nba.php|grep -P "$1\t[^\t]+\t"|cut -f2 ;fi' this looks like has regex in it and does the same thing as mine only with NBA winners — spd92, Mar 03 '14 at 15:02
@spd92 if you save the page as a file it can be done with java. — La-comadreja, Mar 03 '14 at 15:13
@La-comadreja in 160 characters? i have to save the html file and all the files that are needed in the program i can't make my own files i only can use libraries and so on ... — spd92, Mar 03 '14 at 15:17
It can be done, it's not too hard. But it's like building a house on sand: very fragile and when the source page changes, your script breaks. This data set can't be very big. Download it once and hardcode the data in your script. — glenn jackman, Mar 03 '14 at 15:18

Khaelex · Accepted Answer · 2014-03-03T17:05:02.400

2

Using the first HTML sample:

grep '<strong>First ascent:</strong>' | sed 's/.*by \([^>]*\)<.*/\1/'

Output:

Sir Edmund Hillary and Tenzing Norgay
Achille Compagnoni and Lino Lacedelli
George Band and Joe Brown
Kurt Diemberger, Peter Diener, Nawang Dorje, Nima Dorje, Ernst Forrer and Albin Schelbert
Hermann Buhl
Maurice Herzog and Louis Lachenal
Andrew Kauffman and Peter Schoening
Hermann Buhl, Kurt Diemberger, Marcus Schmuck and Fritz Wintersteller

It finds all lines with the 'First ascent' label and grabs everything between by and the <br /> tag.

Edit:

The original answer doesn't filter by the name of the mountain. In addition, the <strong>First ascent:</strong> is too specific for the page (sometimes there is a space after the :). The following should work.

grep -i "$1" -A3 | grep 'First ascent:' | sed 's/.*by \([^>]*\)<.*/\1/'

Explanation: grep -i "$1" -A3 selects the line with the mountain. -i makes the search case insensitive. The -A3 selects the 3 lines following the matched line, which gets the line with the list of climbers. The quotes around "$1" are for mountains with names that have spaces.

edited Mar 03 '14 at 17:05

answered Mar 03 '14 at 15:32

Khaelex

742
5
15

thx a lot but still i have a bit of a problem ... i need only the ones asociated with the give mountain in the parameters ...can u help me with that? – spd92 Mar 03 '14 at 15:49
What is the parameter? The name of the mountain? – Khaelex Mar 03 '14 at 15:53
yes ... and i think some of them are missing there are 14 mountains in total so it should be 14 lines no? i think a few of them aren't showing – spd92 Mar 03 '14 at 15:59
mountains 4,5,6,7,13 are missing i think ... maybe it's easier with the second page ... it's a lot more "cleaner" :) – spd92 Mar 03 '14 at 16:02
PS: i resolved the problem with the missing mountains ... in those cases it was a " " between "First ascent:" and – spd92 Mar 03 '14 at 16:10
`grep 'First ascent:' | grep -i $NAME_OF_MOUNTAIN | sed 's/.*by $[^>]*$<.*/\1/'` The first grep has been changed to fix the missing mountains, the second grep selects the given mountain. – Khaelex Mar 03 '14 at 16:11
1

Sorry, my version of the web page had no line breaks. Try `grep -i $1 -A3 | grep 'First ascent:' | sed 's/.*by $[^>]*$<.*/\1/'` The -A3 for grep returns 3 lines after the matched line, which gets the line with the climbers. The second grep filters the other 3 lines out, to get just the line with the climbers. – Khaelex Mar 03 '14 at 16:40
thx a lot man :) for a few (like 2) of the mountains it's not working properly cuz it's writing some other climbers too but i will fix that somehow :P thx again u helped a lot :D – spd92 Mar 03 '14 at 16:53

BeniBela · Answer 2 · 2014-03-03T16:27:21.033

1

You can use my Xidel which does pattern matching on the html tree:

xidel http://www.alpineascents.com/8000m-peaks.asp -e "<tr><strong>Everest</strong><td/>{3}<td>{.}</td></tr>"

Just 109 characters...

(Replace Everest with $1 if it is inside a script with that as parameter)

Or for the other site:

xidel http://www.valandre.com/blog/2011/06/21/the-14-peaks-over-8000-meters/ -e "<p class=\"wp-caption-text\">Everest</p><strong>First ascent:</strong>{text()}"

edited Mar 03 '14 at 16:27

answered Mar 03 '14 at 16:20

BeniBela

16,412
4
45
52

doesn't recognize the command for me :) but does this returns the first ascents of the given mountain? – spd92 Mar 03 '14 at 16:34
You need to download it from the link; Yes, if you replace Everest by $1 in the line – BeniBela Mar 03 '14 at 16:53

score 0 · Answer 3 · answered Mar 03 '14 at 15:31

Firstly, go with the first page in your question. Here's a Java scraper for the "curl" downloaded file:

import java.util.Scanner;
import java.io.*;

public class PageInfo {
    public static void main(String[] args) {
        Scanner scan = new Scanner(new File(args[0]));  //file you downloaded
        PrintWriter output = new PrintWriter("climbers.txt");
        while (scan.hasNextLine()) {
            String s = scan.nextLine();
            if (s.contains("wp-caption-text\">") {
                s = s.split("wp-caption-text\">")[1];
                if (s.length() > 1) output.println(s.split("</p>")[0]);
            } else if (s.contains("First ascent:")) {
                s = s.split("by ")[1];
                output.println(s.split("<br")[0]);
            }
        }
        scan.close();
        output.close();
    }
}

this is not even close to 160 characters :)) i told u i have a 160 character limit that's why i use shell — spd92, Mar 03 '14 at 15:33

html parsing with grep and regex

3 Answers3