This is a matter of efficiency rather than troubleshooting. I have the following code snippet:
# The -R flag restores malformed XML
xmlstarlet -q fo -R <<<"$xml_content" | \
# Delete xml_data
xmlstarlet ed -d "$xml_data" | \
# Delete index
xmlstarlet ed -d "$xml_index" | \
# Delete specific objects
xmlstarlet ed -d "$xml_nodes/objects" | \
# Append new node
xmlstarlet ed -s "$xml_nodes" -t elem -n subnode -v "Hello World" | \
# Add x attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n x -v "0" | \
# Add y attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n y -v "0" | \
# Add z attribute to node
xmlstarlet ed -i "($xml_nodes)[last()]" -t attr -n z -v "1" \
> "$output_file"
The variable
$xml_content
contains the xml tree of contents and
nodes parsed from a file with size 472.6 MB using thecat
command.The variable
$output_file
as its name indicates, contains the path to the output file.- The rest of the variables simply contain the according XPaths I want to edit.
According to this brief article which helped come up with this code, it indicates that:
This is a bit ineffeciant since the xml file is parsed and written twice.
In my case, it is parsed and written more than twice (eventually in a loop
over 1000 times).
So, taking the above script, the execution time alone of that short fragment is 4 mins and 7 secs.
Assuming the excessive, repetitive and perhaps inefficient piping together with the file size is why the code runs slow, the more subnodes I ultimately insert/delete will eventually cause it to execute even slower.
I apologise in advance if I might sound monotonous by reiterating myself or by bringing out an old and probably already answered topic, however, I'm really keen to understand how xmlstarlet
works in detail with large XML documents.
UPDATE
As claimed by @Cyrus in his prior answer:
Those two xmlstarlets should do the job:
xmlstarlet -q fo -R <<<"$xml_content" |\ xmlstarlet ed \ -d "$xml_data" \ -d "$xml_index" \ -d "$xml_nodes/objects" \ -s "$xml_nodes" -t elem -n subnode -v "Hello World" \ -i "($xml_nodes)[last()]" -t attr -n x -v "0" \ -i "($xml_nodes)[last()]" -t attr -n y -v "0" \ -i "($xml_nodes)[last()]" -t attr -n z -v "1" > "$output_file"
This produced the following errors:
-:691.84: Attribute x redefined
-:691.84: Attribute z redefined
-:495981.9: xmlSAX2Characters: huge text node: out of memory
-:495981.9: Extra content at the end of the document
I honestly don't know how these errors where produced because I changed the code too often testing various scenarios and potential alternatives, however, this is what did the trick for me:
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "Hello World" \
"$temp_xml_file"
xmlstarlet ed --omit-decl -L \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
Regarding the actual data
that is inserted, this is what I have at the beginning:
...
<node>
<subnode>A</subnode>
<subnode>B</subnode>
<objects>1</objects>
<objects>2</objects>
<objects>3</objects>
...
</node>
...
Executing the above (split) code gives me what I want:
...
<node>
<subnode>A</subnode>
<subnode>B</subnode>
<subnode x="0" y="0" z="1">Hello World</subnode>
</node>
...
By splitting them, xmlstarlet
is able to insert the attributes
into the newly created node, else it will add them to the last()
instance of the selected Xpath before the --subnode
is even created. To some extent this is still inefficient, nevertheless, the code runs in less than a minute now.
The following code,
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "Hello World" \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
However, gives me this:
...
<node>
<subnode>A</subnode>
<subnode x="0" y="0" z="1">B</subnode>
<subnode>Hello World</subnode>
</node>
...
By joining the xmlstarlets
into one like in this post also answered by @Cyrus, it somehow first adds the attributes
and then creates the --subnode
where the innerText
is Hello World
.
- Can anyone explain why this strange behaviour is happening??
This is another reference which states that "every edit operation is performed in sequence"
The above article explains exactly what I'm looking for, yet I cannot manage to make it work all in one xmlstarlet ed \
. Alternatively, I tried:
- Replacing
($xml_nodes)[last()]
with$xml_nodes[text() = 'Hello World']
- Using
$prev
(or$xstar:prev
) as the argument to-i
like in this answer. [Examples] - The temporary element name trick via
-r
to rename the temp node after theattr
are added
All of the above insert the --subnode
but leave the new element without attributes
.
Note: I run XMLStarlet 1.6.1 on OS X El Capitan v 10.11.3
BONUS
As I mentioned in the beginning I wish to use a loop
like something along these lines:
list="$(tr -d '\r' < $names)"
for name in $list; do
xmlstarlet ed --omit-decl -L \
-d "$xml_data" \
-d "$xml_index" \
-d "$xml_nodes/objects" \
-s "$xml_nodes" -t elem -n subnode -v "$name" \
-i "($xml_nodes)[last()]" -t attr -n x -v "0" \
-i "($xml_nodes)[last()]" -t attr -n y -v "0" \
-i "($xml_nodes)[last()]" -t attr -n z -v "1" \
"$temp_xml_file"
done
The $list
contains over a thousand different names which need to be added with their respective attributes
. The --value
of each attribute may vary with every loop
as well. Given the above model:
What is the fastest and most accurate version of such
loop
given that the attributes are added correctly to the corresponding node?Would it be faster to create the list of nodes in an external txt file and later add those xml elements (inside the txt file) into another XML file. If yes, how? Perhaps with
sed
orgrep
?
Regarding the last question, I refer to something like this. The node where the xml
from the txt should be added has to be specific, e.g. selectable by XPath at least because I want to edit certain nodes only.
Note: The above model is just an example. The actual loop
will add 26 --subnodes
for each loop
and 3 or 4 attr
for each --subnode
. Thats why it's important for xmlstarlet
to add the attr
properly and not to some other element. They have to be added in order.