9

I am trying to parse a fairly simple web page for information in a shell script. The web page I'm working with now is generated here. For example, I would like to pull the information on the internet service provider into a shell variable. It may make sense to use one of the programs xmllint, XMLStarlet or xpath for this purpose. I am quite familiar with shell scripting, but I am new to XPath syntax and the utilities used to implement the XPath syntax, so I would appreciate a few pointers in the right direction.

Here's the beginnings of the shell script:

HTMLISPInformation="$(curl --user-agent "Mozilla/5.0" http://aruljohn.com/details.php)"
# ISP="$(<XPath magic goes here.>)"

For your convenience, here is a utility for dynamically testing XPath syntax online:

http://www.bit-101.com/xpath/

rae1
  • 6,066
  • 4
  • 27
  • 48
d3pd
  • 7,935
  • 24
  • 76
  • 127

5 Answers5

8

Quick and dirty solution...

xmllint --html -xpath "//table/tbody/tr[6]/td[2]" page.html

You can find the xpath of your node using Chrome and the Developer Tools. When inspecting the node, right click on it and select copy XPath.

I wouldn't use this too much, this is not very reliable.

All the information on your page can be found elsewhere: run whois on your own IP for instance...

Michel Guillet
  • 435
  • 3
  • 6
  • I get "XPath set is empty", even [piping through `sed -e 's/xmlns=".*"//g'``](https://stackoverflow.com/questions/8264134/xmllint-failing-to-properly-query-with-xpath) – Pablo Bianchi Jan 01 '20 at 22:13
  • I had some success with specifying full path from root /html/body/table/tr/td ... and also removing the – Justin Aug 19 '21 at 21:54
5

You could use my Xidel. Extracting values from html pages in the cli is its main purpose. Although it is not a standard tool, it is a single, dependency-free binary, and can be installed/run without being root.

It can directly read the value from the webpage without involving other programs.

With XPath:

 xidel http://aruljohn.com/details.php -e '//td[text()="Internet Provider"]/following-sibling::td'

Or with pattern-matching:

 xidel http://aruljohn.com/details.php -e '<td>Internet Provider</td><td>{.}</td>' --hide-variable-names
BeniBela
  • 16,412
  • 4
  • 45
  • 52
3

Consider on using PhantomJs. It is a headless WebKit, which allows you to execute JavaScript/CoffeeScript on a web page. I think it could help you solve your issue.

Pjscrape is a useful web scraping tool based on PhantomJs.

asgoth
  • 35,552
  • 12
  • 89
  • 98
  • Thank you. I will take a look at it for my personal use. However, the task I hope to accomplish is to be done on a server on which I am not granted root access, which is why I mentioned standard tools such as xmllint. – d3pd Dec 26 '12 at 20:53
  • Do you need root access? You could just copy it in your user folder and run it from there. – asgoth Dec 26 '12 at 21:12
3

xpup

XML

A command line XML parsing tool written in Go. For example:

$ curl -sL https://www.w3schools.com/xml/note.xml | xpup '/*/body'
Don't forget me this weekend!

or:

$ xpup '/note/from' < <(curl -sL https://www.w3schools.com/xml/note.xml)
Jani

HTML

Here is the example of parsing HTML page:

$ xpup '/*/head/title' < <(curl -sL https://example.com/)
Example Domain

pup

For HTML parsing, try pup. For example:

$ pup 'title text{}' -f <(curl -sL https://example.com/)
Example Domain

See related Feature Request for XPath.

Installation

Install by: go get github.com/ericchiang/pup.

Kleber Noel
  • 303
  • 3
  • 9
kenorb
  • 155,785
  • 88
  • 678
  • 743
1

HTML-XML-utils

There are many command-line tools in HTML-XML-utils package which can parse HTML files (e.g. hxselect to match a CSS selector).

Also there is xpath which is command-line wrapper around Perl's XPath library (XML::Path).

Related: Command line tool to query HTML elements at SU

kenorb
  • 155,785
  • 88
  • 678
  • 743