12

I am trying to process a large number of xml files (maven poms) using xmllint --xpath. With some trial and error I figured out that it does not work as expected due to the bad default namespace declaration in these files, which is as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

A simple command fails as follows:

$ echo $(xmllint --xpath '/project/modelVersion/text()' pom.xml )
XPath set is empty

If I get rid of the xmlns attribute, replacing the root element as follows:

<project xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

The previous command gives the expected output:

$ echo $(xmllint --xpath '/project/modelVersion/text()' pom.xml )
4.0.0

Changing hundreds of pom files is not an option, especially since maven itself does not complain.

Is there a way for the xmllint to process the file with the bad xmlns?

UPDATE

Thanks to Damien I was able to make some progress:

$ ( echo setns x=http://maven.apache.org/POM/4.0.0; echo 'xpath /x:project/x:modelVersion/text()'; ) | xmllint --shell pom.xml
/ > setns x=http://maven.apache.org/POM/4.0.0
/ > xpath /x:project/x:modelVersion/text()
Object is a Node Set :
Set contains 1 nodes:
1  TEXT
    content=4.0.0

But this does not quite do what I need. My follow up questions are as follows:

  1. Is there a way to print only the text? I would like the output to contain on 4.0.0 in the above example

  2. It seems the output gets truncated after about 30 characters. Is it possible to get complete output? This does not happen with xmllint --xpath

Cœur
  • 37,241
  • 25
  • 195
  • 267
Miserable Variable
  • 28,432
  • 15
  • 72
  • 133
  • It's not a *bad* namespace. It's a namespace. What that usually means is that you also need to use the namespace in your XPath query, but I'm not familiar with the specifics of the tool you're using to tell you how exactly. – Damien_The_Unbeliever Feb 12 '15 at 09:16
  • 1
    It's bad because it causes xmllint to fail :) Also because the schmealocation is wrong. – Miserable Variable Feb 12 '15 at 10:17
  • 1
    A bit of simple searching on `xmllint namespace` turned up [this question](http://stackoverflow.com/questions/8264134/xmllint-failing-to-properly-query-with-xpath) which seems to show two possible ways of working *with* the namespace. And the schemalocation appears to be correct. It says that the schema identified by the URI `http://maven.apache.org/POM/4.0.0` can be located at the URL `http://maven.apache.org/maven-v4_0_0.xsd` and that would appear to be true. – Damien_The_Unbeliever Feb 12 '15 at 10:39
  • @Damien_The_Unbeliever thanks a lot for the pointer. I was able to make some progress but haven't been able to completely solve the problem. I will update the question, appreciate if you can respond. Thanks again. – Miserable Variable Feb 24 '15 at 04:19

2 Answers2

11

strip the namespace with sed

given in pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
</project>

this:

cat pom.xml | sed '2 s/xmlns=".*"//g' | xmllint --xpath '/project/modelVersion' -

returns this:

<modelVersion>4.0.0</modelVersion>

if you have funky formatting (like, the xmlns attributes are on their own lines), run it through the formatter first:

cat pom.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath '/project/modelVersion' -
djeikyb
  • 4,470
  • 3
  • 35
  • 42
  • Thanks, just saw this. I do have the xmlns attributes spread over multiple lines, `--format` might be a good solution for that – Miserable Variable May 29 '15 at 22:06
  • 1
    Well, that sed removes all attributes ` ` results in ` ` – mau Jun 03 '21 at 11:40
  • 1
    @mau ah i see, it looks like the star is behaving greedily. i'm thinking about how best to edit the answer, but adding a star after the question mark solves the problem, and might be a good fix: `cat tmp.xml | xmllint --format - | sed '2 s/xmlns=".*?"//g'` – djeikyb Jun 03 '21 at 19:19
  • @mau (1) nice catch (2) this is getting unwieldy lol.. limiting the expression's avarice goes too far, its newfound austerity leaves us to deal with "xmlns:query". imo this is at the limits of portable sed; perhaps a switch to perl is warranted? [but.](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/) `perl -wp -e 's/(xmlns|xsi)(:.*?)?=".*?"//g'` if i still like this tomorrow i'll edit it in maybe – djeikyb Jun 03 '21 at 21:03
11
xmllint --xpath "/*[local-name() = 'project']/*[local-name() = 'parent']/*[local-name() = 'version']/text()" pom.xml

For a top level pom.xml:

xmllint --xpath "/*[local-name() = 'project']/*[local-name() = 'version']/text()" pom.xml

It ain't real pretty, but it avoids formatting assumptions and/or re-formatting the input pom.xml file.

If you need to strip off the "-SNAPSHOT" for some reason, pipe the result of the above through | sed -e "s|-SNAPSHOT||".

Charlie Reitzel
  • 809
  • 8
  • 13