0

I have an large set of xml files and I want to change its format a bit. How can i do that?

here is my problam: for example I have following:

<annotation>
<folder>New1</folder>
<filename>0000065.jpg</filename>
<path>C:\Users\farshad\Desktop\New1\0000065.jpg</path>
<source>
    <database>Unknown</database>
</source>
<size>
    <width>710</width>
    <height>287</height>
    <depth>3</depth>
</size>
<segmented>0</segmented>
<object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <difficult>0</difficult>
    <bndbox>
        <xmin>132</xmin>
        <ymin>47</ymin>
        <xmax>574</xmax>
        <ymax>283</ymax>
    </bndbox>
</object>
</annotation>

and I want to change it to following format:

<annotation>
<folder>New1</folder>
<filename>0000065.jpg</filename>
<source>
<database>OXFORD-IIIT Pet Dataset</database>
<annotation>OXIIIT</annotation>
<image>flickr</image>
</source>
<size>
    <width>710</width>
    <height>287</height>
    <depth>3</depth>
</size>
<segmented>0</segmented>
<object>
    <name>car</name>
    <pose>Unspecified</pose>
    <truncated>0</truncated>
    <occluded>0</occluded>
    <bndbox>
        <xmin>132</xmin>
        <ymin>47</ymin>
        <xmax>574</xmax>
        <ymax>283</ymax>
    </bndbox>
    <difficult>0</difficult>
</object>
</annotation>

thanks a lot for any recommendation.

Farshad
  • 41
  • 1
  • 1
  • 9
  • Which language do you use? – Azhy Jul 14 '18 at 10:58
  • Too many options depending on your available skills and tooling (eg. XSLT, any common language has XML parsers/writers). We can help you with details of using a tool, but [SO] doesn't do tool recommendations. – Richard Jul 14 '18 at 10:59
  • You should create a program by using which language do you know and do it yourself and this time when you wrote some code we can help you if you have errors. – Azhy Jul 14 '18 at 11:01
  • 1
    I use python. Is there such tools in python? – Farshad Jul 14 '18 at 11:02
  • Yeh its best just do it yourself and we can help you when you have errors – Azhy Jul 14 '18 at 11:03
  • Or i show an example to get started – Azhy Jul 14 '18 at 11:03
  • could you please show me an example dear Azhy? – Farshad Jul 14 '18 at 11:06
  • Yes i have an easy idea to do that, but a question are all of them have the same formatting as you said you want to change or they are different? – Azhy Jul 14 '18 at 11:11
  • I have a set of 2000 xml file correspond to 2000 image files (jpg) and each of theis xml file has its own xmin, ymin, xmax and ymax as coordinates of rectangle box. – Farshad Jul 14 '18 at 11:18
  • @Farshad Do you want just add some other elements or you want delete some spaces before tags i dont understood. – Azhy Jul 14 '18 at 11:41
  • for example i want to delete : C:\Users\farshad\Desktop\New1\0000065.jpg from first xml and then add OXIIIT flickr after tag . also i want to transfer : 0 in first xml to the end line of file before . So i will have the second xml. – Farshad Jul 14 '18 at 11:43
  • Ok just wait i am do it using regex although theres some other ways to do that but may spend some time – Azhy Jul 14 '18 at 11:55
  • Thanks a lot dear Azhy. – Farshad Jul 14 '18 at 12:00
  • I don't think it's quite fair to mark this as a duplicate of a question that specifically asks for a Python solution, when this question is open to other approaches. The usual way of tackling this kind of transformation is to use XSLT. – Michael Kay Jul 14 '18 at 14:12

2 Answers2

0

The usual approach to this kind of transformation is to use XSLT. I'm not going to write the code for you, and I wouldn't suggest using XSLT without first reading up on the basic concepts of the language, but in outline:

Define a rule for processing the annotation element, which processes all its children using the relevant rules:

<xsl:template match="annotation">
  <xsl:copy>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

Define a default rule for processing the children of annotation, which is to copy them unchanged:

<xsl:template match="annotation/*">
  <xsl:copy-of select="."/>
</xsl:template>

Define a rule for deleting the <path> element:

<xsl:template match="path"/>

Define a rule for transforming the <source> element. I don't know what your logic is for this bit so I'll leave it unfinished:

<xsl:template match="source">
   ...
</xsl:template>

There's a wide choice of XSLT processors available. Many of them (including the default processor for Python) only support XSLT 1.0, which is quite adequate for a simple transformation like this. Later you'll come across more complex transformations that need XSLT 2.0 or 3.0, so you may want to start with a processor that has that capability.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
0

finally i found something and i am sorry for the time because i read something about Regular Expressions which says we cannot use RE with Maliformed Languages like XML or HTML, they hardly says that we cannot use these two things together, so i decided to do it by using DOM packages or xml parser packsges, so now let's get started:-

i created a code for you which you should firstly do some changes to it and after that use it, and i hardly suggest you to firstly use this program with some examples to learn using it first, i am not saying my code is invalid, but you said it's a large amount of files so i don't want to spend all of them wrongly, just first test it to learn how to use it easily.

Some Notes:-

1 - TagIndexes is the index of tag name, which sometimes there are two elements with same name so use it when you it happens, it came from ** *.getElementsByTagName(...)[tagIndex].

2 - Firstly test it on some examples to learn using it, also you can dont do it but i don't want to loose all of your data because of some small errors, also don't scare i dont say my code has errors and you can read it yourself but this warning is because of loosing your data.

3 - Don't forget to set the containing folder.

4 - I wanted to add a future for adding elements after some specified elements or before them, but i didn't because i thought there's no need to do that, and although i created a class to do it if yourself wants.

5 - Write your managing codes in the final for loop in the specified position.

Code

import os, xml.dom.minidom as dom
from enum import Enum

#-----------------------definePath
containingFolder ="pathToContainingFolder"

files = os.listdir(containingFolder)

#if you want to add before and after specific elements
#then add this future to adding method
class addingPlace():

    class types(Enum):
        Parent = 0
        Above  = 1
        Below  = 2

    def __init__(self, TagName, PlaceType):
        self.TagName = TagName
        self.PlaceType = PlaceType

    def getElement(parser, tagIndex=0):
        return parser.getElementsByTagName(self.TagName)[tagIndex];


#---------------------delete element
def deleteElement(selfTag, parser, tagIndex=0):
    global s;
    try:
        s = parser.getElementsByTagName(selfTag)[tagIndex];
    except:
        print("Error in line 25 (tag name or tag index is invalid)")
        return;
    p = s.parentNode;
    try:
        p.removeChild(s);
    except:
        print("Error in line 27 (parent has no specified child)")


#---------------------add element
def addElement(tagName, parentTagName, parser, elementText=None, parentTagIndex=0):
    element = dom.Element(tagName)

    if(elementText is not None):
        txt = dom.Text()
        txt.data = elementText
        element.childNodes.append(txt)

    try:
        parentElement = parser.getElementsByTagName(parentTagName)[parentTagIndex]
        parentElement.childNodes.append(element)
    except:
        print("Error in line 41 (parent tag name or tag index is invalid)")


#-------------------tranfer element to specified parent
def transferElement(tagName, parentTagName, parser, tagIndex=0, parentTagIndex=0):
    try:
        deleting = parser.getElementsByTagName(tagName)[tagIndex];
    except:
        print("Error in line 47 (tag name or tag index is invalid)")
        return;
    element = deleting.cloneNode(True)
    deleting.parentNode.removeChild(deleting)
    try:
        parentElement = parser.getElementsByTagName(parentTagName)[parentTagIndex]
    except:
        print("Error in line 53 (parent tag name or tag index is invalid)")
    parentElement.childNodes.append(element)



#----------------------usage location

for f in files:
    with open(os.path.join(containingFolder, f), 'r+') as fl:
        fileText = fl.read()
        xmlParsed = dom.parseString(fileText)     #use this as parser
        root = xmlParsed.documentElement.nodeName #use this as root element        

        #there you can use adding and deleting and trans.. methods
        # this is an example:-
        #addElement("C_Type",root,xmlParsed,elementText="ASCI")


        formattedText = xmlParsed.toxml()
        fl.seek(0);
        fl.write(formattedText);
        fl.truncate();
Azhy
  • 704
  • 3
  • 16