0

I have an xml file like:

  <Table1>
    <ID>1</ID>
    <Name>ABC1</Name>
  </Table1>
  <Table1>
    <ID>2</ID>
    <Name>ABC2</Name>
  </Table1>
  <Table2>
    <NEWID>1</NEWID>
    <phone>123</phone>
  </Table2>
  <Table2>
    <NEWID>2</NEWID>
    <phone>12334</phone>
  </Table2>
  <Table3>
    <SNO>1</SNO>
    <data>XYZ</data>
  </Table3>
  <Table3>
    <SNO>2</SNO>
    <data>SDF</data>
  </Table3>

I want a new xml file to contain only first data corresponding to each table. i.e. New file should be something like:

  <Table1>
    <ID>1</ID>
    <Name>ABC1</Name>
  </Table1>
  <Table2>
    <NEWID>1</NEWID>
    <phone>123</phone>
  </Table2>
  <Table3>
    <SNO>1</SNO>
    <data>XYZ</data>
  </Table3>

Actually the file I am working on has hundreds of such tables with a million rows so it is not possible to make new xml manually. Please help if I can do such thing in python or any other way

  • 1
    This should be possible to do in Python. Note that a root element is required in XML. Have you tried anything at all? – mzjn Aug 16 '23 at 11:48
  • Please use universal measurements instead of local words like lakh. A lot of people aren't going to understand what lakhs of rows means. – James Z Aug 20 '23 at 14:33

2 Answers2

0

You could use xslt for it. See this answer on how to use xslt in python

In the xslt you will need would be this:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>

  <xsl:template match="/*">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>
  
  <xsl:template match="*[starts-with(name(),'Table')]">
    <xsl:variable name="name"  select="name()"/>
    <xsl:if test="not(preceding-sibling::*[name()=$name])">
      <xsl:copy-of select="."/>
    </xsl:if>
  </xsl:template>
  
</xsl:stylesheet>

Given this input xml:

<?xml version="1.0" encoding="utf-8"?>
<root>
  <Table1>
    <ID>1</ID>
    <Name>ABC1</Name>
  </Table1>
  <Table1>
    <ID>2</ID>
    <Name>ABC2</Name>
  </Table1>
  <Table2>
    <NEWID>1</NEWID>
    <phone>123</phone>
  </Table2>
  <Table2>
    <NEWID>2</NEWID>
    <phone>12334</phone>
  </Table2>
  <Table3>
    <SNO>1</SNO>
    <data>XYZ</data>
  </Table3>
  <Table3>
    <SNO>2</SNO>
    <data>SDF</data>
  </Table3>
</root>

wil result in this output xml:

<?xml version="1.0" encoding="utf-8"?>
<root>
   <Table1>
      <ID>1</ID>
      <Name>ABC1</Name>
   </Table1>
   <Table2>
      <NEWID>1</NEWID>
      <phone>123</phone>
   </Table2>
   <Table3>
      <SNO>1</SNO>
      <data>XYZ</data>
   </Table3>
</root>
Siebe Jongebloed
  • 3,906
  • 2
  • 14
  • 19
0

You can keep only one element and remove the others:

import xml.etree.ElementTree as ET

xml="""<?xml version="1.0" encoding="utf-8"?>
<root>
  <Table1>
    <ID>1</ID>
    <Name>ABC1</Name>
  </Table1>
  <Table1>
    <ID>2</ID>
    <Name>ABC2</Name>
  </Table1>
  <Table2>
    <NEWID>1</NEWID>
    <phone>123</phone>
  </Table2>
  <Table2>
    <NEWID>2</NEWID>
    <phone>12334</phone>
  </Table2>
  <Table3>
    <SNO>1</SNO>
    <data>XYZ</data>
  </Table3>
  <Table3>
    <SNO>2</SNO>
    <data>SDF</data>
  </Table3>
</root>"""

root = ET.fromstring(xml)

# Keep only first entry in the tree
collect = []
for elem in root.findall("./*"):
    if elem.tag not in collect:
        collect.append(elem.tag)
    else:
        root.remove(elem)

# show result
ET.dump(root)

# write result to file
root1 = ET.ElementTree(root)
ET.indent(root1, space= '  ')
root1.write('reduced.xml', encoding="utf-8", xml_declaration=True)

Output:

<?xml version='1.0' encoding='utf-8'?>
<root>
  <Table1>
    <ID>1</ID>
    <Name>ABC1</Name>
  </Table1>
  <Table2>
    <NEWID>1</NEWID>
    <phone>123</phone>
  </Table2>
  <Table3>
    <SNO>1</SNO>
    <data>XYZ</data>
  </Table3>
</root>
Hermann12
  • 1,709
  • 2
  • 5
  • 14