15

I have several xml files. They all have the same structure, but were splitted due to file size. So, let's say I have A.xml, B.xml, C.xml and D.xml and want to combine/merge them to combined.xml, using a command line tool.

A.xml

<products>
    <product id="1234"></product>
    ...
</products>

B.xml

<products>
  <product id="5678"></product>
  ...
</products>

etc.

Zombo
  • 1
  • 62
  • 391
  • 407
TutanRamon
  • 205
  • 1
  • 3
  • 9

5 Answers5

23

High-tech answer:

Save this Python script as xmlcombine.py:

#!/usr/bin/env python
import sys
from xml.etree import ElementTree

def run(files):
    first = None
    for filename in files:
        data = ElementTree.parse(filename).getroot()
        if first is None:
            first = data
        else:
            first.extend(data)
    if first is not None:
        print(ElementTree.tostring(first))

if __name__ == "__main__":
    run(sys.argv[1:])

To combine files, run:

python xmlcombine.py ?.xml > combined.xml

For further enhancement, consider using:

  • chmod +x xmlcombine.py: Allows you to omit python in the command line

  • xmlcombine.py !(combined).xml > combined.xml: Collects all XML files except the output, but requires bash's extglob option

  • xmlcombine.py *.xml | sponge combined.xml: Collects everything in combined.xml as well, but requires the sponge program

  • import lxml.etree as ElementTree: Uses a potentially faster XML parser

Martin Delille
  • 11,360
  • 15
  • 65
  • 132
eswald
  • 8,368
  • 4
  • 28
  • 28
  • In MacOS Python 2.6 ... I'm getting `_ElementInterface instance has no attribute 'extend'`. Which version of python are you using? – metasim Dec 11 '12 at 13:47
  • 1
    This was probably Python 2.7 on Linux. You could work around that by iterating over the children and adding each one manually; or perhaps lxml will work for you. – eswald Dec 14 '12 at 16:51
  • on trying i am getting ./xmlcombine.py ?.xml > combine.xml Traceback (most recent call last): File "./xmlcombine.py", line 17, in run(sys.argv[1:]) File "./xmlcombine.py", line 8, in run data = ElementTree.parse(filename).getroot() File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse tree.parse(source, parser) File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 579, in parse source = open(source, "rb") IOError: [Errno 2] No such file or directory: '?.xml' – Vik Mar 06 '16 at 23:53
  • 1
    @Vik: The question mark is a shell globbing character, standing for any single character. If your xml files aren't named like the ones in the question, then ?.xml fails to match anything, so the shell passes it as-is to the program, which complains because there's no file with that name. Try passing it the real names of your files; just make sure that the output file doesn't get passed in as an input. – eswald Mar 07 '16 at 19:15
  • @eswald i am getting the error: "xml.etree.ElementTree.ParseError: junk after document elemente: line 89, column 0", what should I do? – Inês Martins May 31 '16 at 08:51
  • 1
    @InêsMartins: That means one of your source xml files isn't correctly formatted. First, ensure that your output file isn't getting listed as one of the input files by mistake; if that's not the issue, then check the format of each input file. Try running xmlcombine.py on each file individually, and see which one produces the error. – eswald Jun 02 '16 at 17:45
  • 1
    This just concatenates both input files within a single top level element. It does not merge on a lower level. So, I get something like ``. It does not merge the lower levels as in `` – gctwnl May 15 '20 at 10:08
  • For pretty printing change it like this: https://i.imgur.com/s10GEBp.png (ref: https://stackoverflow.com/a/39984422/6907424) – hafiz031 Jul 08 '21 at 01:24
10

xml_grep

http://search.cpan.org/dist/XML-Twig/tools/xml_grep/xml_grep

xml_grep --pretty_print indented --wrap products --descr '' --cond "product" *.xml > combined.xml

  • --wrap : encloses/wraps the the xml result with the given tag. (here: products)
  • --cond : the xml subtree to grep (here: product)
berk
  • 116
  • 1
  • 3
2

Low-tech simple answer:

echo '<products>' > combined.xml
grep -vh '</\?products>\|<?xml' *.xml >> combined.xml
echo '</products>' >> combined.xml

Limitations:

  • The opening and closing tags need to be on their own line.
  • The files need to all have the same outer tags.
  • The outer tags must not have attributes.
  • The files must not have inner tags that match the outer tags.
  • Any current contents of combined.xml will be wiped out instead of getting included.

Each of these limitations can be worked around, but not all of them easily.

eswald
  • 8,368
  • 4
  • 28
  • 28
0

Merging 2 trees includes the task to identify what is identical and what should be replaced. Unfortunately, this is not obvious. There is more semantic involved than what can be inferred from the source XML documents.

Consider the case where the first document has a middle level with several elements having the same tag, but different attributes. The second document adds an attribute to that middle level to an existing element, but also another child to it. One has to know the semantic.

<params>
...
<param><name>hello</name><value>world</value></param>
...
</params>

add/merge:

<params>
   <param><name>hello</name><value>yellow submarine</value></param>
</params>
0

Another very helpful tool is yq, which aims to be jq for YAML, TOML and XML.

It can be installed via pip, the xml handling command is then called xq.

pip install yq
xq .products ?.xml --xml-output --xml-root=products > combined.xml
Gerald Schneider
  • 17,416
  • 9
  • 60
  • 78