2

I have just learned about xmlstarlet, but unfortunately I have a really hard time with XML, so I hope I'll get some help with this ...

Say, I have this XML file, test.xml:

<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <g id="layer3" inkscape:label="hello">
    <circle id="circ2" inkscape:label="there"/>
    <rect id="rect2" inkscape:label="world"/>
  </g>
  <g id="layer4">
    <circle id="circ3" inkscape:label="more"/>
  </g>
</objects>

So what I want to do is: for each node where inkscape:label attribute exists, copy value of the inkscape:label attribute to the id attribute; so the expected output from the above would be:

<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <g id="hello" inkscape:label="hello">
    <circle id="there" inkscape:label="there"/>
    <rect id="world" inkscape:label="world"/>
  </g>
  <g id="layer4">
    <circle id="more" inkscape:label="more"/>
  </g>
</objects>

How can I do this with xmlstarlet?


Apparently I can replace all id attributes with a fixed value by using expression string("TEST") like this:

$  xmlstarlet edit -N inkscape="http://www.inkscape.org/namespaces/inkscape" --update '//*/@id' --expr 'string("TEST")'
test.xml
test.xml:3.40: Namespace prefix inkscape for label on g is not defined
  <g id="layer3" inkscape:label="hello">
                                       ^
test.xml:4.46: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ2" inkscape:label="there"/>
                                             ^
test.xml:5.44: Namespace prefix inkscape for label on rect is not defined
    <rect id="rect2" inkscape:label="world"/>
                                           ^
test.xml:8.45: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ3" inkscape:label="more"/>
                                            ^
<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <g id="TEST" inkscape:label="hello">
    <circle id="TEST" inkscape:label="there"/>
    <rect id="TEST" inkscape:label="world"/>
  </g>
  <g id="TEST">
    <circle id="TEST" inkscape:label="more"/>
  </g>
</objects>

... and I can "reinsert" the value of the attribute id with expression string(../@id) like this (so I basically get same output as input):

$ xmlstarlet edit -N inkscape="http://www.inkscape.org/namespaces/inkscape" --update '//*/@id' --expr 'string(../@id)' test.xml
test.xml:3.40: Namespace prefix inkscape for label on g is not defined
  <g id="layer3" inkscape:label="hello">
                                       ^
test.xml:4.46: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ2" inkscape:label="there"/>
                                             ^
test.xml:5.44: Namespace prefix inkscape for label on rect is not defined
    <rect id="rect2" inkscape:label="world"/>
                                           ^
test.xml:8.45: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ3" inkscape:label="more"/>
                                            ^
<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <g id="layer3" inkscape:label="hello">
    <circle id="circ2" inkscape:label="there"/>
    <rect id="rect2" inkscape:label="world"/>
  </g>
  <g id="layer4">
    <circle id="circ3" inkscape:label="more"/>
  </g>
</objects>

... but I cannot use the same trick (expression string(../@inkscape:label) - or string(../@*[local-name()='label']) as per How does local-name find attributes in an xml node?) to read from attribute inkscape:label - and I cannot really tell whether that is because of the "Namespace prefix" .. "not defined" message:

$ xmlstarlet edit -N inkscape="http://www.inkscape.org/namespaces/inkscape" --update '//*/@id' --expr 'string(../@inkscape:label)' test.xml
test.xml:3.40: Namespace prefix inkscape for label on g is not defined
  <g id="layer3" inkscape:label="hello">
                                       ^
test.xml:4.46: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ2" inkscape:label="there"/>
                                             ^
test.xml:5.44: Namespace prefix inkscape for label on rect is not defined
    <rect id="rect2" inkscape:label="world"/>
                                           ^
test.xml:8.45: Namespace prefix inkscape for label on circle is not defined
    <circle id="circ3" inkscape:label="more"/>
                                            ^
<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <g id="" inkscape:label="hello">
    <circle id="" inkscape:label="there"/>
    <rect id="" inkscape:label="world"/>
  </g>
  <g id="">
    <circle id="" inkscape:label="more"/>
  </g>
</objects>

And via get attribute value using xmlstarlet or xmllint ; I can confirm I can target the id attribute with:

xmlstarlet select -N inkscape="http://www.inkscape.org/namespaces/inkscape" --template --value-of '//*/@id' test.xml

... but the corresponding command for the inkscape:label returns nothing:

xmlstarlet select -N inkscape="http://www.inkscape.org/namespaces/inkscape" --template --value-of '//*/@inkscape:label' test.xml

It's probably that namespace thing, but I don't understand how can I ignore the namespace, and just relate to the attribute names in the document as they are ...


EDIT: finally solved the issue here with Python 3:

#!/usr/bin/env python3

# https://stackoverflow.com/questions/30097949/elementtree-findall-to-recursively-select-all-child-elements
# https://stackoverflow.com/questions/13372604/python-elementtree-parsing-unbound-prefix-error
# https://stackoverflow.com/questions/2352840/parsing-broken-xml-with-lxml-etree-iterparse
# https://stackoverflow.com/questions/28813876/how-do-i-get-pythons-elementtree-to-pretty-print-to-an-xml-file

import sys
import lxml
import lxml.etree
import xml.etree.ElementTree as ET

def proc_node(node):
  target_label = 'inkscape:label' # file without namespace, like `test.xml` here
  #target_label = '{http://www.inkscape.org/namespaces/inkscape}label' # file with namespace (like proper Inkscape .svg)
  if target_label in node.attrib:
    node.attrib['id'] = node.attrib[target_label]
  for childel in node.getchildren():
    proc_node(childel)


parser1 = lxml.etree.XMLParser(encoding="utf-8", recover=True)
tree1 = ET.parse('test.xml', parser1)
ET.indent(tree1, space="  ", level=0)
proc_node(tree1.getroot())
print(lxml.etree.tostring(tree1.getroot(), xml_declaration=True, pretty_print=True, encoding='UTF-8').decode('utf-8'))

... if I call this xmlproc.py, then the result is:

$ python3 xmlproc.py
<?xml version='1.0' encoding='UTF-8'?>
<objects>
  <g id="hello" inkscape:label="hello">
    <circle id="there" inkscape:label="there"/>
    <rect id="world" inkscape:label="world"/>
  </g>
  <g id="layer4">
    <circle id="more" inkscape:label="more"/>
  </g>
</objects>

... which is exactly what I wanted.

So to specify in the spirit of how the question is postulated - how do I achieve this with xmlstarlet?

sdbbs
  • 4,270
  • 5
  • 32
  • 87
  • 1
    Are you sure `test.xml` really looks like your sample xml in the question? That sample doesn't have a namespace declaration for `inkspace` and ET would return an "unbound prefix" error. – Jack Fleeting Oct 13 '22 at 18:39
  • Thanks @JackFleeting - indeed, my actual file is an `inkspace` one, but since I forgot everything about XML (and namespaces) that I might have known in the past, I got surprised to see that doing a "minimal example" would not work in general, due to XML namespace prefixes. So the Python code shows how to handle that in either case - and I am wondering if, with special switches, `xmlstarlet` can be made to do the same (i.e. process both a minimal XML file with no namespace info apart from attribute prefix, and a "real" "properly namespaced" XML file) – sdbbs Oct 13 '22 at 19:09

2 Answers2

2
in the spirit of how the question is postulated - how do I achieve this with xmlstarlet?

The input file doesn't define the inkscape namespace which causes the XML parser () to issue messages and parse the inkscape:label nodes as attributes belonging to the null namespace. Recall that : (colon) in component names is tolerated but unrecommended, and the default namespace doesn't apply to attribute names.

To produce the desired output you can say,

xmlstarlet -q edit \
  -u '//*[@*[local-name()="inkscape:label"][namespace-uri()=""]]/@id' \
  -x 'string(../@*[local-name()="inkscape:label"][namespace-uri()=""])' \
file.xml

where

  • the global -q option suppresses messages from the parser about the missing namespace definition
  • local-name() and namespace-uri() are used as a workaround in the special case where an unprefixed name contains a : (colon), because @inkscape:label would cause the parser to look for the non-existing inkscape namespace

Since inkscape is probably a namespace prefix in most contexts here are 2 alternative methods. You can have xmlstarlet add a missing namespace node by modifying the input using either edit,

xmlstarlet -q edit \
  -s '*' -t attr -n 'xmlns:inkscape' -v 'http://www.inkscape.org/namespaces/inkscape' \
file.xml |
xmlstarlet edit -u '//*[@inkscape:label]/@id' -x 'string(../@inkscape:label)'

or a pyx … | sed … | depyx - pipeline,

xmlstarlet -q pyx file.xml |
sed '1a\
Axmlns:inkscape http://www.inkscape.org/namespaces/inkscape' |
xmlstarlet depyx - |
xmlstarlet edit -u '//*[@inkscape:label]/@id' -x 'string(../@inkscape:label)'

The root element of the output XML will contain the inkscape namespace node. To produce the desired output (less the XML declaration) the namespace node can be deleted by appending | xmlstarlet pyx - | sed '/^Axmlns:inkscape /d' | xmlstarlet depyx - to the previous command. (xmlstarlet edit cannot delete namespace nodes.)

Converting XML to PYX notation during processing is occasionally useful for simple queries or editing of non-complex data, but XPath access is out. Beware that xmlstarlet's pyx and depyx commands do not guarantee an accurate roundtrip. depyx, for example, outputs non-collapsed empty elements (such as <void></void>) so an extra pass through xmlstarlet format is sometimes in order.

urznow
  • 1,576
  • 1
  • 4
  • 13
1

It can be done with xmllint in 3 steps:

  1. get label values
  2. Build XPath to set @id values
  3. execute XPath expression
    # Step 1 - get label values into an array for elements containing both attributes
    labels=( $(printf '%s\n' 'setrootns' 'cat //*[@inkscape:label and @id]/@inkscape:label' | xmllint --shell tmp.xml | sed -rne '/inkscape:label/ s/inkscape:label="(.*)"/\1/p' ) )

    # Step 2 - build xpath
    xpath=( 'setrootns' )
    for i in "${!labels[@]}"; do
        # get current element name ;-)
        xpath[${#xpath[@]}]="xpath name((//*[@inkscape:label and @id])[$i+1])"
        xpath[${#xpath[@]}]="cd (//*[@inkscape:label and @id])[$i+1]/@id"
        xpath[${#xpath[@]}]="set ${labels[$i]}"
    done
    xpath[${#xpath[@]}]='save'
    xpath[${#xpath[@]}]='bye'

    #Step 3 - execute Xpath
    printf "%s\n" "${xpath[@]}" | xmllint --shell tmp.xml

Most significant XPath expression is the one to find elements having both attributes where nodeset index is the labels array index + 1

(//*[@inkscape:label and @id])[$i+1]

Script output

/ > setrootns
/ > xpath name((//*[@inkscape:label and @id])[0+1])
Object is a string : g
/ > cd (//*[@inkscape:label and @id])[0+1]/@id
id > set hello
id > xpath name((//*[@inkscape:label and @id])[1+1])
Object is a string : circle
id > cd (//*[@inkscape:label and @id])[1+1]/@id
id > set there
id > xpath name((//*[@inkscape:label and @id])[2+1])
Object is a string : rect
id > cd (//*[@inkscape:label and @id])[2+1]/@id
id > set world
id > xpath name((//*[@inkscape:label and @id])[3+1])
Object is a string : circle
id > cd (//*[@inkscape:label and @id])[3+1]/@id
id > set more
id > save
id > bye
LMC
  • 10,453
  • 2
  • 27
  • 52