21

I need to find all places in a bunch of HTML files, that lie in following structure (CSS):

div.a ul.b

or XPath:

//div[@class="a"]//div[@class="b"]

grep doesn't help me here. Is there a command-line tool that returns all files (and optionally all places therein), that match this criterium? I.e., that returns file names, if the file matches a certain HTML or XML structure.

Boldewyn
  • 81,211
  • 44
  • 156
  • 212
  • You might be able to get fancy with sed and come up with some regex to strip out the elements you don't care about; but that is probably going to be complicated and not reusable unless you write it off somewhere. I would just write a perl script which uses something like XML::Twig::XPath and prints a message with file name for all xmls w/the class attributes you're looking for. If you're interested, I could post a quick script as an answer; but since you're specifically asking for command line solution I'll hold off on that. – Dave Sep 07 '11 at 17:00
  • 1
    Similar question http://superuser.com/questions/507344/command-line-tool-to-query-html-elements-linux – Smit Johnth Oct 15 '14 at 15:55

4 Answers4

27

Try this:

  1. Install http://www.w3.org/Tools/HTML-XML-utils/.
    • Ubuntu: aptitude install html-xml-utils
    • MacOS: brew install html-xml-utils
  2. Save a web page (call it filename.html).
  3. Run: hxnormalize -l 240 -x filename.html | hxselect -s '\n' -c "label.black"

Where "label.black" is the CSS selector that uniquely identifies the name of the HTML element. Write a helper script named cssgrep:

#!/bin/bash

# Ignore errors, write the results to standard output.
hxnormalize -l 240 -x $1 2>/dev/null | hxselect -s '\n' -c "$2"

You can then run:

cssgrep filename.html "label.black"

This will generate the content for all HTML label elements of the class black.

The -l 240 argument is important to avoid parsing line-breaks in the output. For example if <label class="black">Text to \nextract</label> is the input, then -l 240 will reformat the HTML to <label class="black">Text to extract</label>, inserting newlines at column 240, which simplifies parsing. Extending out to 1024 or beyond is also possible.

See also:

Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
9

I have built a command line tool with Node JS which does just this. You enter a CSS selector and it will search through all of the HTML files in the directory and tell you which files have matches for that selector.

You will need to install Element Finder, cd into the directory you want to search, and then run:

elfinder -s "div.a ul.b"

For more info please see http://keegan.st/2012/06/03/find-in-files-with-css-selectors/

Keegan Street
  • 155
  • 1
  • 6
6

There are two tools:

  • pup - Inspired by jq, pup aims to be a fast and flexible way of exploring HTML from the terminal.

  • htmlq - Likes jq, but for HTML. Uses CSS selectors to extract bits of content from HTML files.

Examples:

$ wget http://en.wikipedia.org/wiki/Robots_exclusion_standard -O robots.html

$ pup --color 'title' < robots.html
<title>
 Robots exclusion standard - Wikipedia
</title>

$ htmlq --text 'title' < robots.html
Robots exclusion standard - Wikipedia
kev
  • 155,172
  • 47
  • 273
  • 272
-1

Per Nat's answer here:

How to parse XML in Bash?

Command-line tools that can be called from shell scripts include:

4xpath - command-line wrapper around Python's 4Suite package
XMLStarlet
xpath - command-line wrapper around Perl's XPath library
Community
  • 1
  • 1
Dave
  • 6,141
  • 2
  • 38
  • 65
  • OK, that's a good way to handle XML. Seems like the synopsis code here: http://search.cpan.org/~msergeant/XML-XPath-1.13/XPath.pm would exactly fit my needs. However, if I have non-XML HTML (e.g., I have some SSI snippets to search) I also need a non-XML tool. Any ideas? – Boldewyn Sep 08 '11 at 07:03
  • In terms of SSI, you should be able to use xpath, since they're basically xml comments parsed and handled by your server. http://stackoverflow.com/questions/784745/accessing-comments-in-xml-using-xpath – Dave Sep 08 '11 at 15:35
  • Pretty much any variation of html should work and you should be able to get access to any of the information in it using xpath as long as its well formed (this could be mitigated by libraries used to format malformed html), and not inside of a CDATA element (which you wouldn't be able to use xpath to get to since it isn't handled as markup). – Dave Sep 08 '11 at 15:37