0

I have 100's of xml files in a directory. The structure of the xml is exactly the same. However, I want to add some of the nodes of the xml together and retain the rest as it is.

Example xml 1

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0420.jpg'></image>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0544.jpg'>
<box top='343' left='72' width='92' height='29'>
<label>LBS_Marks
</label></box></image>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0489.jpg'></image>
</images>
</dataset>

Example xml 2

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file="/home/orcl/user102339/Area123/Geo_Tag_0812-0420.jpg">
    <box top="505" left="326" width="59" height="32">
                <label>SBS_Marks</label>
            </box>
    </image>
<image file="/home/orcl/user102339/Area123/Geo_Tag_0812-0544.jpg">
    <box top="507" left="331" width="50" height="27">
                <label>SBS_Marks</label>
            </box>
    </image>
<image file="/home/orcl/user102339/Area123/Geo_Tag_0812-0489.jpg">
    <box top="509" left="330" width="51" height="25">
                <label>SBS_Marks</label>
            </box>
    </image>
</images>
</dataset>

In both these data sets, the images are the same however the markings are different. For example, in the first example set, the first image 0420.jpg does not have any box tags associated with it, while the same image in the second file has box tag with label SBS_Marks associated with. I am trying to merge these files together, so that for each image, I get only the box coordinates and label. For example the desired output will be as follows:

<?xml-stylesheet type='text/xsl' href='image_metadata_stylesheet.xsl'?>
<dataset>
<name>imglab dataset</name>
<comment>Created by imglab tool.</comment>
<images>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0420.jpg'>
<box top="505" left="326" width="59" height="32">
                <label>SBS_Marks</label>
            </box>
</image>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0544.jpg'>
<box top='343' left='72' width='92' height='29'>
<label>LBS_Marks
</label></box>
<box top="507" left="331" width="50" height="27">
                <label>SBS_Marks</label>
            </box>
</image>
<image file='/home/orcl/user102339/Area123/Geo_Tag_0812-0489.jpg'>
<box top="509" left="330" width="51" height="25">
                <label>SBS_Marks</label>
            </box>

</image>
</images>
</dataset>

In the desired output example, the first image 0420.jpg has the box and label elements from second file, second image 0544.jpg has two boxes and labels one each from file 1 and file 2 and third image has the box and label from the second file.

I tried using this code:

#!/usr/bin/env python
import sys
from xml.etree import ElementTree

def run(files):
    first = None
    for filename in files:
        data = ElementTree.parse(filename).getroot()
        if first is None:
            first = data
        else:
            first.extend(data)
    if first is not None:
        print ElementTree.tostring(first)

if __name__ == "__main__":
    run(sys.argv[1:])

But this just prints the contents of the file one after the other but does not merge. I don't know how to create an xsl template, hence could not try with it. Can someone help with a better code for the above or provide an xsl template that helps me in merging all these files in the folder please.

popeye
  • 281
  • 5
  • 20
  • 1
    In the merged document, shouldn't the `box` element be a child of the first `image`? Do you need to merge those files with Python and XSLT 1? Also do all files have the same `image` (it seems identity is based on the `file` attribute value elements in the same order? – Martin Honnen Nov 19 '17 at 13:38
  • @MartinHonnen Each image has a chance of having a box element as a child and whenever a box element is present, the box element has a child 'label'. The problem is each image many have multiple box elements or none at all depending on whether we identify the entities from the image. Yes I need to merge the files with Python as my other codes are in python...though I am ok with using a bash script also if that is easier. All the files will have the same images...for example, the images 420, 544, 489 will be present in all the 100's of folders. – popeye Nov 20 '17 at 03:57
  • @MartinHonnen Yes u r right...i have corrected the typo in the desired output. Do you have any thoughts....i tried reading up xslt manuals....but I could not progress much yet. – popeye Nov 21 '17 at 09:17

1 Answers1

1

If you are restricted to XSLT 1 then I think one approach is to use Python to construct an XML document listing all XML files of your directory you want to merge e.g. in the format

<files>
  <file name="doc1.xml"/>
  <file name="doc2.xml"/>
  ...
</files>

then you use that file as the input document to your XSLT and write code as

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>

    <xsl:variable name="files" select="document(files/file/@name)"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="/">
        <xsl:apply-templates select="$files[1]/node()"/>
    </xsl:template>

    <xsl:template match="image[@file]">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
            <xsl:apply-templates select="$files[position() > 1]//image[@file = current()/@file]/node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Of course if you have hundreds of files to be merged then document(files/file/@name) will pull them all into memory but I don't see any way around that if you want to merge them all with a single transformation.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110
  • Very kind of you. I was taking inspiration from the book XSLT - Mastering XML Transformation by Doug Tidwell. He had specified a very similar approach. I have already created an xmlfile listing all the files.. But couldn't get the style sheet created. Will check this out and get back to you. Thank you. – popeye Nov 21 '17 at 10:49
  • My apologies to bother you. I created the ref xml with all file names and saved it as listofxmls.xml. I changed the 5th line of xsl starting `.` I ran the code as given in this post `https://stackoverflow.com/questions/32651932/merging-lots-of-xml-files`...but gives me constant errors. – popeye Nov 22 '17 at 10:30
  • 1
    `` does not make sense to me, if you want to load that file as a secondary input then you need ``, with two nested `document` calls. – Martin Honnen Nov 22 '17 at 10:37
  • Thank you so much for all the clarification and patience. The xsl file works fine. Thank you again. – popeye Nov 22 '17 at 16:48