0

I want to read a CSV file and replace the tags within the xml file with the second column of the CSV file. The tag 'name' values are in the first column.

A         |    B

Value1    |    ValueX
Value2    |    ValueX
Value3    |    ValueY

XML structure looks like.

<products>
   <product>
      <name>Value1</name>
   </product>
   <product>
      <name>Values2</name>
   </product>
   <product>
      <name>Values3</name>
   </product>
</products>

Python code

import csv 
import collections
import xml.etree.ElementTree
tree = xml.etree.ElementTree.parse("jolly.xml").getroot()

with open('file.csv', 'r') as f:
    reader = csv.DictReader(f)# read rows into a dictionary format
    reader = csv.reader(f, dialect=csv.excel_tab)
    list = list(reader)
    columns = collections.defaultdict(list)# each value in each column is appended to a list

for (k, v) in row.items(): #go over each column name and value
    columns[k].append(v)# append the value into the appropriate list

print columns['A']
print columns['B']
for elem in tree.findall('.//name'):
    if elem.attrib['name'] == columns['A']:
        elem.attrib['name'] == columns['B']

How can I handle it?

Here is how the CSV file looks like:

Reading CSV file looks like

The output should be looks like this:

Value1 should be replaced with ValueX

Ok here is my solution:

import lxml.etree as ET


arr = ["Value1", "Value2", "Value3"]
arr2 = ["ValuX", "ValuX", "ValueY"]

with open('file.xml', 'rb+') as f:
    tree = ET.parse(f)
    root = tree.getroot()
    for i, item in enumerate(arr):
         for elem in root.findall('.//Value1'):
             print(elem);
             if elem.tag:
                 print(item)
                 print(arr2[i])

                 elem.text = elem.text.replace(item, arr2[i])



    f.seek(0)
    f.write(ET.tostring(tree, encoding='UTF-8', xml_declaration=True))
    f.truncate()

Well I am using an array. I can just copy the values from file into array. For huge files it needs a better code.

Tony
  • 223
  • 8
  • 20

1 Answers1

0

Consider using XSLT, the special purpose, declarative language designed to restructure XML files. Like most other general purpose languages including ASP, C#, Java, PHP, Perl, VB, Python maintains an XSLT 1.0 processor, specifically in its lxml module.

And for your purposes, you can dynamically create an XSLT string that can be used for the transformation. Only loop needed is looping through csv data:

import csv
import lxml.etree as ET

# READ IN CSV DATA AND APPEND TO LIST
csvdata = []
with open('file.csv'), 'r') as csvfile:
    readCSV = csv.reader(csvfile)
    for line in readCSV:
        csvdata.append(line)

# DYNAMICALLY CREATE XSLT STRING
xsltstr = '''<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
            <xsl:output version="1.0" encoding="UTF-8" indent="yes" />
            <xsl:strip-space elements="*"/>

              <!-- Identity Transform -->
              <xsl:template match="@*|node()">
                <xsl:copy>
                  <xsl:apply-templates select="@*|node()"/>
                </xsl:copy>
              </xsl:template>

        '''

for i in range(len(csvdata)):
    xsltstr = xsltstr + \
              '''<xsl:template match="name[.='{0}']">
                  <xsl:element name="{1}">
                     <xsl:apply-templates />
                  </xsl:element>
              </xsl:template>

              '''.format(*csvdata[i])

xsltstr = xsltstr + '</xsl:transform>'

# PARSE ORIGINAL FILE AND XSLT STRING
dom = ET.parse('jolly.xml')
xslt = ET.fromstring(xsltstr)

# TRANSFORM XML
transform = ET.XSLT(xslt)
newdom = transform(dom)

# OUTPUT FINAL XML (PRETTY PRINT)
tree_out = ET.tostring(newdom, encoding='UTF-8', pretty_print=True,  xml_declaration=True)

xmlfile = open('final.xml'),'wb')
xmlfile.write(tree_out)
xmlfile.close()

OUTPUT

<?xml version='1.0' encoding='UTF-8'?>
<products>
  <product>
    <ValueX>Value1</ValueX>
  </product>
  <product>
    <ValueY>Value2</ValueY>
  </product>
  <product>
    <ValueZ>Value3</ValueZ>
  </product>
</products>
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Hi many thanx for your help. Iam getting following error.: Traceback (most recent call last): File "readCSVReplaceTags.py", line 2, in import lxml.etree as ET ImportError: No module named lxml.etree. I have installed lxml but its not working. Is there any other module which I can go same way? – Tony Dec 30 '15 at 21:34
  • You do not have lxml installed. *I have installed lxml but its not working*? Try reinstalling `pip install lxml` and you also need `libxml2-dev` and `libxslt1-dev`. See [SO post](http://stackoverflow.com/questions/5178416/pip-install-lxml-error) – Parfait Dec 30 '15 at 22:15
  • I am using Mac OS X 10.11 – Tony Dec 30 '15 at 22:16
  • Traceback (most recent call last): File "readCSVReplaceTags.py", line 11, in for line in readCSV: File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdf in position 33: invalid continuation byte – Tony Dec 30 '15 at 23:17
  • You have special characters in your csv file: accents, foreign language items, etc. Which may require encoding. Post actual data so we can see. Do note XML tags should not have a space and not begin with a number. So check Column B. – Parfait Dec 30 '15 at 23:43
  • jup I have ü, ä &, etc...How can I decoded within the code? – Tony Dec 30 '15 at 23:46
  • Also getting this error: File "readCSVReplaceTags.py", line 36, in '''.format(*csvdata[i]) IndexError: tuple index out of range – Tony Dec 31 '15 at 00:25
  • If you failed to import, csvdata will be empty. And even if you import into Python, you will run into encoding issues in XML. Specify the encoding type in [`open()`](https://docs.python.org/3/howto/unicode.html) and in `ET.tostring()`. From your posted example data, I assure you this answer works. – Parfait Dec 31 '15 at 00:50
  • Nope, its not working. File "readCSVReplaceTags.py", line 33, in '''.format(*csvdata[i]) IndexError: tuple index out of range If remove .format() then it works without utf-8. If I add uft-8 then I am getthing ceback (most recent call last): File "readCSVReplaceTags.py", line 6, in with open('file.csv', 'r', 'utf-8') as csvfile: TypeError: an integer is required (got type str) – Tony Dec 31 '15 at 00:55
  • Please post a snippet of actual data. There's something about the data triggering these issues. Any character `[A-Za-z0-9]` csv data should work. I can only speculate at this point. – Parfait Dec 31 '15 at 01:13
  • See first post from me! Thnx – Tony Dec 31 '15 at 01:32