Parsing for data in HTML using XPath (in a shell script)

Question

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

Take a look at [this](http://www.w3schools.com/xpath/xpath_syntax.asp). — rae1, Dec 26 '12 at 20:04

score 8 · Answer 1 · answered Dec 26 '12 at 21:16

8

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

answered Dec 26 '12 at 21:16

Michel Guillet

435
3
6

I get "XPath set is empty", even [piping through `sed -e 's/xmlns=".*"//g'``](https://stackoverflow.com/questions/8264134/xmllint-failing-to-properly-query-with-xpath) – Pablo Bianchi Jan 01 '20 at 22:13
I had some success with specifying full path from root /html/body/table/tr/td ... and also removing the – Justin Aug 19 '21 at 21:54

score 5 · Answer 2 · answered Dec 26 '12 at 21:13

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

 xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

 xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names

score 3 · Answer 3 · answered Dec 26 '12 at 20:08

3

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

answered Dec 26 '12 at 20:08

asgoth

35,552
12
89
98

Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint. – d3pd Dec 26 '12 at 20:53
Do you need root access? You could just copy it in your user folder and run it from there. – asgoth Dec 26 '12 at 21:12

score 3 · Answer 4 · edited Jul 14 '23 at 17:21

`xpup`

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain

`pup`

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

kenorb · Answer 5 · 2018-04-11T12:01:04.427

1

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

edited Apr 11 '18 at 12:01

answered Oct 17 '15 at 00:27

kenorb

155,785
88
678
743