7

I'm trying to write a simple script to simply check a webpage for a specific value:

$("a#infgHeader").text() == "Delivered";

I'd like to automate this from a Bash script to be run at an interval. I'm also fine with using Python. I need to essentially make an HTTP request, get the response, and have a way to intelligently query the result. Is there a library which will help me with the querying part?

Naftuli Kay
  • 87,710
  • 93
  • 269
  • 411

5 Answers5

11

Xpath is great for querying html.

Something like this:

//a[@id='infgHeader']/@text

In chrome developer tool you can use the search box in the Elements tab to test the expression.

Quick run in terminal:

$echo '<div id="test" text="foo">Hello</div>' | xpath '//div[@id="test"]/@text' 
Found 1 nodes:
-- NODE --
 text="foo"
ebaxt
  • 8,287
  • 1
  • 34
  • 36
  • Hooray for xPath! I was wondering if it would be of help here. I didn't know because HTML != XML, but hey, if it works, it works. – Naftuli Kay Feb 29 '12 at 19:33
  • 1
    `xpath` works poorly with not-strictly-XML HTML code. When running it on a 100-line HTML snippet, it freezes for a minute then dies with a "mismatched tag" error, apparently because the code had `` and not ``. – Tgr Mar 23 '16 at 14:48
  • Yes, xpath is not reliable. Guess I'll use [regular expressions to parse HTML](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) then. – phil294 Mar 17 '18 at 10:31
2

http://pypi.python.org/pypi/spynner/1.10

Spynner will let you select elements from the dom using jquery syntax.

Or there are other libraries that let you parse HTML. BeautifulSoup, lxml

dm03514
  • 54,664
  • 18
  • 108
  • 145
1

Alex MacCaw wrote up a nice post that does just what you're asking using node.js / JavaScript. There are a LOT of capabilities it brings too.

http://alexmaccaw.com/posts/node_jquery_xml_parsing

Joshua
  • 3,615
  • 1
  • 26
  • 32
0

I have recently done something like this using nodejs + jsdom both are well documented with a low entry barrier.

OlduwanSteve
  • 1,263
  • 14
  • 16
0

To parse html is not trivial for general websites, because html may not be prefect and DOM can be modified by java-script on the fly, so parsing html may not make sense in such case.

Best way is to use a browser and directly access the DOM, for that you can use a headless browser like phontomjs, so you can script it and check whatever you need to check

Anurag Uniyal
  • 85,954
  • 40
  • 175
  • 219