230

Is there a package out there, for Ubuntu and/or CentOS, that has a command-line tool that can execute an XPath one-liner like foo //element@attribute filename.xml or foo //element@attribute < filename.xml and return the results line by line?

I'm looking for something that would allow me to just apt-get install foo or yum install foo and then just works out-of-the-box, no wrappers or other adaptation necessary.

Here are some examples of things that come close:

Nokogiri. If I write this wrapper I could call the wrapper in the way described above:

#!/usr/bin/ruby

require 'nokogiri'

Nokogiri::XML(STDIN).xpath(ARGV[0]).each do |row|
  puts row
end

XML::XPath. Would work with this wrapper:

#!/usr/bin/perl

use strict;
use warnings;
use XML::XPath;

my $root = XML::XPath->new(ioref => 'STDIN');
for my $node ($root->find($ARGV[0])->get_nodelist) {
  print($node->getData, "\n");
}

xpath from XML::XPath returns too much noise, -- NODE -- and attribute = "value".

xml_grep from XML::Twig cannot handle expressions that do not return elements, so cannot be used to extract attribute values without further processing.

EDIT:

echo cat //element/@attribute | xmllint --shell filename.xml returns noise similar to xpath.

xmllint --xpath //element/@attribute filename.xml returns attribute = "value".

xmllint --xpath 'string(//element/@attribute)' filename.xml returns what I want, but only for the first match.

For another solution almost satisfying the question, here is an XSLT that can be used to evaluate arbitrary XPath expressions (requires dyn:evaluate support in the XSLT processor):

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"
    xmlns:dyn="http://exslt.org/dynamic" extension-element-prefixes="dyn">
  <xsl:output omit-xml-declaration="yes" indent="no" method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="dyn:evaluate($pattern)">
      <xsl:value-of select="dyn:evaluate($value)"/>
      <xsl:value-of select="'&#10;'"/>
    </xsl:for-each> 
  </xsl:template>
</xsl:stylesheet>

Run with xsltproc --stringparam pattern //element/@attribute --stringparam value . arbitrary-xpath.xslt filename.xml.

Mike
  • 397
  • 2
  • 16
clacke
  • 7,688
  • 6
  • 46
  • 48

18 Answers18

329

You should try these tools :

  • xidel (xidel): xpath3
  • xmlstarlet (xmlstarlet page) : can edit, select, transform... Not installed by default, xpath1
  • xmllint (man xmllint): often installed by default with libxml2-utils, xpath1 (check my wrapper to have --xpath switch on very old releases and newlines delimited output (v < 2.9.9)). Can be used as interactive shell with the --shell switch.
  • xpath : installed via perl's module XML::Xpath, xpath1
  • xml_grep : installed via perl's module XML::Twig, xpath1 (limited xpath usage)
  • saxon-lint (saxon-lint): my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3: using SaxonHE 9.6 ,XPath 3.x (+retro compatibility)

Examples:

xmllint --xpath '//element/@attribute' file.xml
xmlstarlet sel -t -v "//element/@attribute" file.xml
xpath -q -e '//element/@attribute' file.xml
xidel -se '//element/@attribute' file.xml
saxon-lint --xpath '//element/@attribute' file.xml
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • 8
    Excellent! `xmlstarlet sel -T -t -m '//element/@attribute' -v '.' -n filename.xml` does exactly what I want! – clacke Mar 17 '13 at 14:49
  • 2
    Note: xmlstarlet was rumored to be abandoned, but is now under active development again. – clacke Mar 25 '13 at 10:20
  • 6
    Note: Some older versions of `xmllint` do not support command line argument `--xpath`, but most seem to support `--shell`. Slight dirtier output, but still useful in a bind. – kevinarpe May 13 '15 at 10:56
  • I am still seem to have trouble querying for node contents, not an attribute. Can anyone provide an example for that? For some reason, I still find xmlstarlet difficult to figure out and get right between matching, value, root to just view the document structure, and etc.. Even with the first `sel -t -m ... -v ...` example from this page: http://arstechnica.com/information-technology/2005/11/linux-20051115/2/, matching all but the last node and saving that one for the value expression like my use case, I still can't seem to get it, I just get blank output.. – Pysis Nov 03 '16 at 03:04
  • nice one on the version of xpath - I'd just run into this limitation of the otherwise excellent xmllint – JonnyRaa Nov 24 '17 at 10:24
  • On my Linux Mint machine (a derivative of Ubuntu/Debian), `xmllint` doesn't come with `libxml2` but with `libxml2-utils`. – toon81 Jan 20 '19 at 19:01
  • For `xidel`, download [xidel-0.9.8.linux64.tar.gz](https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9.8/xidel-0.9.8.linux64.tar.gz/download), enter `./install.sh` – Ivan Chau Apr 26 '19 at 06:57
  • Please look also at https://stackoverflow.com/questions/41114695/get-pom-xml-version-with-xmllint/41115011#41115011 if you wish use `xmllint` on documents with namespaces – Hubbitus Sep 06 '19 at 13:24
  • I tried both xmllint and xmlstarlet, but only xmlstarlet was able to handle my namespace issue cleanly. As a bonus, I was able to concatenate string literals and multiple element and attribute values to produce a CSV file with a one-line xmlstarlet command. xmlstarlet is still the winner for me even though it hasn't been updated in years. – shoover Nov 11 '22 at 19:22
29

You can also try my Xidel. It is not in a package in the repository, but you can just download it from the webpage (it has no dependencies).

It has simple syntax for this task:

xidel filename.xml -e '//element/@attribute' 

And it is one of the rare of these tools that supports XPath 2.

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
BeniBela
  • 16,412
  • 4
  • 45
  • 52
  • 5
    Xidel looks pretty cool, though you should probably mention that you are the also the author of this tool that you recommend. – FrustratedWithFormsDesigner Jul 20 '16 at 17:38
  • 1
    Saxon and saxon-lint use xpath3 ;) – Gilles Quénot Sep 25 '16 at 18:11
  • Xidel (0..8.win32.zip) shows up as having malware on Virustotal. So try at your own risk https://www.virustotal.com/#/file/96854c2be1e3755f56fabb8f00d1fe567108461b9fab139039219a1b7c17e382/detection – JGFMK May 09 '18 at 13:17
  • great - I am going to add xidel to my personal wrench tool box – maoizm Nov 12 '18 at 12:01
  • Nice! I had to run a recursive search for XML files with node(s) matching a given xpath query. Used xidel with find like so: `find . -name "*.xml" -printf '%p : ' -exec xidel {} -s -e 'expr' \;` – Vasan Aug 14 '20 at 07:12
  • @Vasan With a lot of xml-files running `xidel` for each and every xml-file is very inefficient! With the [EXPath File Module](http://www.benibela.de/documentation/internettools/xpath-functions.html#modulefile) `xidel` can do that much faster: `xidel -se 'file:list(.,true(),"*.xml") ! concat(.," : ",doc(.)/{expr})'` – Reino Oct 02 '20 at 11:20
19

One package that is very likely to be installed on a system already is python-lxml. If so, this is possible without installing any extra package:

python -c "from lxml.etree import parse; from sys import stdin; print('\n'.join(parse(stdin).xpath('//element/@attribute')))"
Heath Borders
  • 30,998
  • 16
  • 147
  • 256
clacke
  • 7,688
  • 6
  • 46
  • 48
  • 1
    How to pass filename? – Ramakrishnan Kannan Jul 23 '16 at 12:37
  • 6
    This works on `stdin`. That eliminates the need for including `open()` and `close()` in an already quite lengthy one-liner. To parse a file just run `python -c "from lxml.etree import parse; from sys import stdin; print '\n'.join(parse(stdin).xpath('//element/@attribute'))" < my_file.xml` and let your shell handle the file lookup, opening and closing. – clacke Jul 28 '16 at 11:26
11

In my search to query maven pom.xml files I ran accross this question. However I had the following limitations:

  • must run cross-platform.
  • must exist on all major linux distributions without any additional module installation
  • must handle complex xml-files such as maven pom.xml files
  • simple syntax

I have tried many of the above without success:

  • python lxml.etree is not part of the standard python distribution
  • xml.etree is but does not handle complex maven pom.xml files well, have not digged deep enough
  • python xml.etree does not handle maven pom.xml files for unknown reason
  • xmllint does not work either, core dumps often on ubuntu 12.04 "xmllint: using libxml version 20708"

The solution that I have come across that is stable, short and work on many platforms and that is mature is the rexml lib builtin in ruby:

ruby -r rexml/document -e 'include REXML; 
     puts XPath.first(Document.new($stdin), "/project/version/text()")' < pom.xml

What inspired me to find this one was the following articles:

Mike
  • 397
  • 2
  • 16
  • 1
    That's even narrower criteria than the question, so it definitely fits as an answer. I'm sure many people who ran into your situation will be helped by your research. I'm keeping `xmlstarlet` as the accepted answer, because it fits my wider criteria and it's *really neat*. But I will probably have use for your solution from time to time. – clacke May 14 '14 at 14:59
  • 2
    I would add that to **avoid quotes around the result**, use `puts` instead of `p` in the Ruby command. – tooomg Jul 03 '15 at 09:07
10

Saxon will do this not only for XPath 2.0, but also for XQuery 1.0 and (in the commercial version) 3.0. It doesn't come as a Linux package, but as a jar file. Syntax (which you can easily wrap in a simple script) is

java net.sf.saxon.Query -s:source.xml -qs://element/attribute

2020 UPDATE

Saxon 10.0 includes the Gizmo tool, which can be used interactively or in batch from the command line. For example

java net.sf.saxon.Gizmo -s:source.xml
/>show //element/@attribute
/>quit
Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • SaxonB is in Ubuntu, package `libsaxonb-java`, but if I run `saxonb-xquery -qs://element/@attribute -s:filename.xml` I get `SENR0001: Cannot serialize a free-standing attribute node`, same problem as with e.g. `xml_grep`. – clacke Mar 25 '13 at 10:18
  • 3
    If you want to see full details of the attribute node selected by this query, use the -wrap option on the command line. If you just want the string value of the attribute, add /string() to the query. – Michael Kay Mar 26 '13 at 18:25
  • Thanks. Adding /string() gets closer. But it outputs an XML header and puts all the results on one row, so still no cigar. – clacke Mar 27 '13 at 10:30
  • 2
    If you don't want an XML header, add the option !method=text. – Michael Kay Mar 29 '13 at 22:07
  • To use namespace add it to `-qs` like this: `'-qs:declare namespace mets="http://www.loc.gov/METS/";/mets:mets/mets:dmdSec'` – igo Aug 24 '16 at 12:26
  • You might want to specify the classpath, e.g. like `java -classpath /usr/share/java/saxonb.jar net.sf.saxon.Query -s:source.xml -qs://element/@attribute`. – Thomas W Jun 03 '20 at 05:36
5

You might also be interested in xsh. It features an interactive mode where you can do whatever you like with the document:

open 1.xml ;
ls //element/@id ;
for //p[@class="first"] echo text() ;
choroba
  • 231,213
  • 25
  • 204
  • 289
5

clacke’s answer is great but I think only works if your source is well-formed XML, not normal HTML.

So to do the same for normal Web content—HTML docs that aren’t necessarily well-formed XML:

echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
from lxml import html; \
print '\n'.join(html.tostring(node) for node in html.parse(stdin).xpath('//p'))"

And to instead use html5lib (to ensure you get the same parsing behavior as Web browsers—because like browser parsers, html5lib conforms to the parsing requirements in the HTML spec).

echo "<p>foo<div>bar</div><p>baz" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print '\n'.join(html.tostring(node) for node in doc.xpath('//p'))
Community
  • 1
  • 1
sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
  • Yes, I fell for my own assumption in the question, that XPath implies XML. This answer is a good complement to the others here, and thanks for letting me know about html5lib! – clacke Feb 18 '16 at 04:57
4

Similar to Mike's and clacke's answers, here is the python one-liner (using python >= 2.5) to get the build version from a pom.xml file that gets around the fact that pom.xml files don't normally have a dtd or default namespace, so don't appear well-formed to libxml:

python -c "import xml.etree.ElementTree as ET; \
  print(ET.parse(open('pom.xml')).getroot().find('\
  {http://maven.apache.org/POM/4.0.0}version').text)"

Tested on Mac and Linux, and doesn't require any extra packages to be installed.

pdr
  • 360
  • 2
  • 6
  • 2
    I used this today! Our build servers had neither `lxml` nor `xmllint`, or even Ruby. In the spirit of the format in [my own answer](https://stackoverflow.com/a/15471368/260122), I wrote it as `python3 -c "from xml.etree.ElementTree import parse; from sys import stdin; print(parse(stdin).find('.//element[subelement=\"value\"]/othersubelement').text)" <<< "$variable_containing_xml"` in bash. `.getroot()` doesn't seem necessary. – clacke Jan 30 '18 at 04:17
3

A minimal wrapper for python's lxml module that will print all matching nodes by name (at any level), e.g. mysubnode or an XPath subset e.g. //intermediarynode/subnode. If the expression evaluates to text then text will be printed, if it evaluates to an element then the entire raw element will be rendered to text. It also attempts to handle XML namespaces in a way that allows using local tag names without prefixing. With extended XPath mode enabled via the -x flag the default namespace needs to be referenced with the p: prefix, e.g. //p:tagname/p:subtag

#!/usr/bin/env python3
import argparse
import os
import sys

from lxml import etree

DEFAULT_NAMESPACE_KEY = 'p'

def print_element(elem):
    if isinstance(elem, str):
        print(elem)
    elif isinstance(elem, bytes):
        print(elem.decode('utf-8'))
    else:
        print(elem.text and elem.text.strip() or etree.tostring(elem, encoding='unicode', pretty_print=True))


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='XPATH lxml wrapper',
                                     usage="""
    Print all nodes by name in XML file:                                     
    \t{0} myfile.xml somename

    Print all nodes by XPath selector (findall: reduced subset):                                     
    \t{0} myfile.xml //itermediarynode/childnode

    Print attribute values by XPath selector 'p' maps to default namespace (xpath 1.0: extended subset):                                     
    \t{0} myfile.xml //p:itermediarynode/p:childnode/@src -x
                          
     """.format(os.path.basename(sys.argv[0])))
    parser.add_argument('xpath_file',
                        help='XPath file path')
    parser.add_argument('xpath_expression',
                        help='tag name or xpath expression')
    parser.add_argument('--force_xpath', '-x',
                        action='store_true',
                        default=False,
                        help='Use lxml.xpath (rather than findall)'
    )

    args = parser.parse_args(sys.argv[1:])
    xpath_expression = args.xpath_expression

    tree = etree.parse(args.xpath_file)

    ns = tree.getroot().nsmap

    if args.force_xpath:
        if ns.keys() and None in ns:
            ns[DEFAULT_NAMESPACE_KEY] = ns.pop(None)
        for node in tree.xpath(xpath_expression, namespaces=ns):
            print_element(node)

    elif xpath_expression.isalpha():
        for node in tree.xpath(f"//*[local-name() = '{xpath_expression}']"):
            print_element(node)
    else:
        for el in tree.findall(xpath_expression, namespaces=ns):
            print_element(el)


It uses lxml — a fast XML parser written in C which is not included in the standard python library. Install it with pip install lxml. On Linux/OSX might need prefixing with sudo.

Usage:

python3 xmlcat.py file.xml "//mynode"

lxml can also accept an URL as input:

python3 xmlcat.py http://example.com/file.xml "//mynode" 

Extract the url attribute under an enclosure node i.e. <enclosure url="http:...""..>) (-x forces an extended XPath 1.0 subset):

python3 xmlcat.py xmlcat.py file.xml "//enclosure/@url" -x

Xpath in Google Chrome

As an unrelated side note: If by chance you want to run an XPath expression against the markup of a web page then you can do it straight from the Chrome devtools: right-click the page in Chrome > select Inspect, and then in the DevTools console paste your XPath expression as $x("//spam/eggs").

Example: get all authors on this page:

$x("//*[@class='user-details']/a/text()")
ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • Not a one-liner, and `lxml` was already mentioned in [two](https://stackoverflow.com/a/15471368/260122) other [answers](https://stackoverflow.com/a/35446446/260122) years before yours. – clacke Jan 30 '18 at 04:25
2

In addition to XML::XSH and XML::XSH2 there are some grep-like utilities suck as App::xml_grep2 and XML::Twig (which includes xml_grep rather than xml_grep2). These can be quite useful when working on a large or numerous XML files for quick oneliners or Makefile targets. XML::Twig is especially nice to work with for a perl scripting approach when you want to a a bit more processing than your $SHELL and xmllint xstlproc offer.

The numbering scheme in the application names indicates that the "2" versions are newer/later version of essentially the same tool which may require later versions of other modules (or of perl itself).

G. Cito
  • 6,210
  • 3
  • 29
  • 42
  • `xml_grep2 -t //element@attribute filename.xml` works and does what I expect it to (`xml_grep --root //element@attribute --text_only filename.xml` still doesn't, returns an "unrecognized expression" error). Great! – clacke Mar 07 '14 at 13:57
  • What about ```xml_grep --pretty_print --root '//element[@attribute]' --text_only filename.xml```? Not sure what is going on there or what XPath says about ```[]``` in this case, but surrounding an ```@attribute``` with square brackets works for ```xml_grep``` and ```xml_grep2```. – G. Cito Mar 07 '14 at 14:33
  • I mean `//element/@attribute`, not `//element@attribute`. Can't edit it apparently, but leaving it there rather than delete+replace to not confuse the history of this discussion. – clacke Mar 19 '14 at 14:48
  • `//element[@attribute]` selects elements of type `element` that have an attribute `attribute`. I do not want the element, only the attribute. `` should give me `foo`, not the full ``. – clacke Mar 19 '14 at 14:51
  • ... and `--text_only` in that context gives me the empty string in the case of an element like `` with no text node inside. – clacke Mar 19 '14 at 14:53
  • Minor correction "Xml" instead of "xml" : `sudo cpan App::Xml_grep2` – JJoao Dec 22 '16 at 09:06
2

It bears mentioning that nokogiri itself ships with a command line tool, which should be installed with gem install nokogiri.

You might find this blog post useful.

Geoff Nixon
  • 4,697
  • 2
  • 28
  • 34
2

Here's one xmlstarlet use case to extract data from nested elements elem1, elem2 to one line of text from this type of XML (also showing how to handle namespaces):

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<mydoctype xmlns="http://xml-namespace-uri" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xml-namespace-uri http://xsd-uri" format="20171221A" date="2018-05-15">

  <elem1 time="0.586" length="10.586">
      <elem2 value="cue-in" type="outro" />
  </elem1>

</mydoctype>

The output will be

0.586 10.586 cue-in outro

In this snippet, -m matches the nested elem2, -v outputs attribute values (with expressions and relative addressing), -o literal text, -n adds a newline:

xml sel -N ns="http://xml-namespace-uri" -t -m '//ns:elem1/ns:elem2' \
 -v ../@time -o " " -v '../@time + ../@length' -o " " -v @value -o " " -v @type -n file.xml

If more attributes are needed from elem1, one can do it like this (also showing the concat() function):

xml sel -N ns="http://xml-namespace-uri" -t -m '//ns:elem1/ns:elem2/..' \
 -v 'concat(@time, " ", @time + @length, " ", ns:elem2/@value, " ", ns:elem2/@type)' -n file.xml

Note the (IMO unnecessary) complication with namespaces (ns, declared with -N), that had me almost giving up on xpath and xmlstarlet, and writing a quick ad-hoc converter.

diemo
  • 171
  • 1
  • 12
  • xmlstarlet is great, but the accepted and main ranking answer already mentions it. The information about how to handle namespaces might have been relevant as a comment, if at all. Anyone running into issues with namespaces and xmlstarlet can find an excellent [discussion in the documentation](http://xmlstar.sourceforge.net/doc/UG/xmlstarlet-ug.html#idm47077139530992) – clacke May 20 '18 at 15:49
  • 2
    Sure, @clacke, xmlstarlet has been mentioned several times, but also that it is hard to grasp, and underdocumented. I was guessing around for an hour how to get information out of nested elements. I wish I had had that example, that's why I am posting it here to avoid others that loss of time (and the example is too long for a comment). – diemo May 21 '18 at 20:53
2

My Python script xgrep.py does exactly this. In order to search for all attributes attribute of elements element in files filename.xml ..., you would run it as follows:

xgrep.py "//element/@attribute" filename.xml ...

There are various switches for controlling the output, such as -c for counting matches, -i for indenting the matching parts, and -l for outputting filenames only.

The script is not available as a Debian or Ubuntu package, but all of its dependencies are.

2

Install the BaseX database, then use it's "standalone command-line mode" like this:

basex -i - //element@attribute < filename.xml

or

basex -i filename.xml //element@attribute

The query language is actually XQuery (3.0), not XPath, but since XQuery is a superset of XPath, you can use XPath queries without ever noticing.

igneus
  • 963
  • 10
  • 25
1

Since this project is apparently fairly new, check out https://github.com/jeffbr13/xq , seems to be a wrapper around lxml, but that is all you really need (and posted ad hoc solutions using lxml in other answers as well)

mgrandi
  • 3,389
  • 1
  • 18
  • 17
1

I wasn't happy with Python one-liners for HTML XPath queries, so I wrote my own. Assumes that you installed python-lxml package or ran pip install --user lxml:

function htmlxpath() { python -c 'for x in __import__("lxml.html").html.fromstring(__import__("sys").stdin.read()).xpath(__import__("sys").argv[1]): print(x)' $1 }

Once you have it, you can use it like in this example:

> curl -s https://slashdot.org | htmlxpath '//title/text()'
Slashdot: News for nerds, stuff that matters
d33tah
  • 10,999
  • 13
  • 68
  • 158
0

Sorry to be yet another voice in the fray. I tried all the tools in this thread and found none of them to be satisfactory for my needs, so I wrote my own. You can find it here: https://github.com/charmparticle/xpe

It's been uploaded to pypi, so you can easily install it with pip3 like so:

sudo pip3 install xpe

Once installed, you can use it to run xpath expressions against various kinds of input with the same level of flexibility you would get from using xpaths in selenium or javascript. Yeah, you can use xpaths against HTML with this.

0

A solution that works even when namespace declarations exist on top:

Most of the commands proposed in the answers do not work out of the box if the xml has a namespace declared on top. Consider this:

input xml:

<elem1 xmlns="urn:x" xmlns:prefix="urn:y">
    <elem2 attr1="false" attr2="value2">
        elem2 value
    </elem2>
    <elem2 attr1="true" attr2="value2.1">
        elem2.1 value
    </elem2>    
    <prefix:elem3>
        elem3 value
    </prefix:elem3>        
</elem1>

Does not work:

xmlstarlet sel -t -v "/elem1" input.xml
# nothing printed
xmllint -xpath "/elem1" input.xml
# XPath set is empty

Solution:

# Requires >=java11 to run like below (but the code requires >=java17 for case syntax to be recognized)

# Prints the whole document
java ExtractXpath.java "/" example-inputs/input.xml

# Prints the contents and self of "elem1"
java ExtractXpath.java "/elem1" input.xml

# Prints the contents and self of "elem2" whose attr2 value is: 'value2'
java ExtractXpath.java "//elem2[@attr2='value2']" input.xml

# Prints the value of the attribute 'attr2': "value2", "value2.1"
java ExtractXpath.java "/elem1/elem2/@attr2" input.xml

# Prints the text inside elem3: "elem3 value"
java ExtractXpath.java "/elem1/elem3/text()" input.xml

# Prints the name of the matched element: "prefix:elem3"
java ExtractXpath.java "name(/elem1/elem3)" input.xml
# Same as above: "prefix:elem3"
java ExtractXpath.java "name(*/elem3)" input.xml

# Prints the count of the matched elements: 2.0
java ExtractXpath.java "count(/elem2)" input.xml


# known issue: while "//elem2" works. "//elem3" does not (it works only with: '*/elem3' )


ExtractXpath.java:


import java.io.File;
import java.io.FileInputStream;
import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.stream.Collectors;

import javax.xml.XMLConstants;
import javax.xml.namespace.NamespaceContext;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathEvaluationResult;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

public class ExtractXpath {

    public static void main(String[] args) throws Exception {
        assertThat(args.length==2, "Wrong number of args");
        String xpath = args[0];
        File file = new File(args[1]);
             
        assertThat(file.isFile(), file.getAbsolutePath()+" is not a file.");
        FileInputStream fileIS = new FileInputStream(file);
        DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = builderFactory.newDocumentBuilder();
        Document xmlDocument = builder.parse(fileIS);
        XPath xPath = XPathFactory.newInstance().newXPath();
        String expression = xpath;
        XPathExpression xpathExpression =  xPath.compile(expression);
        
        XPathEvaluationResult xpathEvalResult =  xpathExpression.evaluateExpression(xmlDocument);
        System.out.println(applyXpathExpression(xmlDocument, xpathExpression, xpathEvalResult.type().name()));
    }

    private static String applyXpathExpression(Document xmlDocument, XPathExpression expr, String xpathTypeName) throws TransformerConfigurationException, TransformerException, XPathExpressionException {

        // see: https://www.w3.org/TR/1999/REC-xpath-19991116/#corelib
        List<String> retVal = new ArrayList();
        if(xpathTypeName.equals(XPathConstants.NODESET.getLocalPart())){ //e.g. xpath: /elem1/*
            NodeList nodeList = (NodeList)expr.evaluate(xmlDocument, XPathConstants.NODESET);
            for (int i = 0; i < nodeList.getLength(); i++) {
                retVal.add(convertNodeToString(nodeList.item(i)));
            }
        }else if(xpathTypeName.equals(XPathConstants.STRING.getLocalPart())){ //e.g. xpath: name(/elem1/*)
            retVal.add((String)expr.evaluate(xmlDocument, XPathConstants.STRING));
        }else if(xpathTypeName.equals(XPathConstants.NUMBER.getLocalPart())){ //e.g. xpath: count(/elem1/*)
            retVal.add(((Number)expr.evaluate(xmlDocument, XPathConstants.NUMBER)).toString());
        }else if(xpathTypeName.equals(XPathConstants.BOOLEAN.getLocalPart())){ //e.g. xpath: contains(elem1, 'sth')
            retVal.add(((Boolean)expr.evaluate(xmlDocument, XPathConstants.BOOLEAN)).toString());
        }else if(xpathTypeName.equals(XPathConstants.NODE.getLocalPart())){ //e.g. xpath: fixme: find one
            System.err.println("WARNING found xpathTypeName=NODE");
            retVal.add(convertNodeToString((Node)expr.evaluate(xmlDocument, XPathConstants.NODE)));
        }else{
            throw new RuntimeException("Unexpected xpath type name: "+xpathTypeName+". This should normally not happen");
        }
        return retVal.stream().map(str->"==MATCH_START==\n"+str+"\n==MATCH_END==").collect(Collectors.joining ("\n"));
        
    }
    
    private static String convertNodeToString(Node node) throws TransformerConfigurationException, TransformerException {
            short nType = node.getNodeType();
        switch (nType) {
            case Node.ATTRIBUTE_NODE , Node.TEXT_NODE -> {
                return node.getNodeValue();
            }
            case Node.ELEMENT_NODE, Node.DOCUMENT_NODE -> {
                StringWriter writer = new StringWriter();
                Transformer trans = TransformerFactory.newInstance().newTransformer();
                trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
                trans.setOutputProperty(OutputKeys.INDENT, "yes");
                trans.transform(new DOMSource(node), new StreamResult(writer));
                return writer.toString();
            }
            default -> {
                System.err.println("WARNING: FIXME: Node type:"+nType+" could possibly be handled in a better way.");
                return node.getNodeValue();
            }
                
        }
    }

    
    private static void assertThat(boolean b, String msg) {
        if(!b){
            System.err.println(msg+"\n\nUSAGE: program xpath xmlFile");
            System.exit(-1);
        }
    }
}

@SuppressWarnings("unchecked")
class NamespaceResolver implements NamespaceContext {
    //Store the source document to search the namespaces
    private final Document sourceDocument;
    public NamespaceResolver(Document document) {
        sourceDocument = document;
    }

    //The lookup for the namespace uris is delegated to the stored document.
    @Override
    public String getNamespaceURI(String prefix) {
        if (prefix.equals(XMLConstants.DEFAULT_NS_PREFIX)) {
            return sourceDocument.lookupNamespaceURI(null);
        } else {
            return sourceDocument.lookupNamespaceURI(prefix);
        }
    }

    @Override
    public String getPrefix(String namespaceURI) {
        return sourceDocument.lookupPrefix(namespaceURI);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public Iterator getPrefixes(String namespaceURI) {
        return null;
    }
}

and for simplicity:

xpath-extract command:

#!/bin/bash
java ExtractXpath.java "$1" "$2"

Marinos An
  • 9,481
  • 6
  • 63
  • 96