2

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.

enter image description here

I am already tried answers from here, without success.

The converted XML file.

Thanks for any help!

Community
  • 1
  • 1
Adrian
  • 2,576
  • 9
  • 49
  • 97
  • 3
    The simple answer is not to use that Python script, which doesn't actually know how to write valid XML. Instead of trying to fix something that builds bad output, use something that does the job the right way in the first place instead. – Charles Duffy Mar 12 '14 at 23:50
  • Hmmm! I would appreciate any other suggestion for an universal cli csv2xml converter. :) – Adrian Mar 12 '14 at 23:52
  • there's no such thing (and what you linked to isn't one either), because there exists no single, universal way to represent a tabular syntax in a structured language. That tool you pointed at makes a bunch of assumptions about what the output should look like; there's nothing "universal" about it. That said, given clarifications about just what that output should be, pretty much any competent developer could write such a tool in the space of five minutes. – Charles Duffy Mar 12 '14 at 23:55

3 Answers3

10

Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python

>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'

So for your specific case, try something like

from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)

Very terrible style, but that script is a hacked together script anyway for a one-shot job.

Community
  • 1
  • 1
metatoaster
  • 17,419
  • 5
  • 55
  • 66
  • Thanks for the help! I replaced relatd part, but I got this error: File "x.py", line 26, in for row in csvData: UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 133: ordinal not in range(128) – Adrian Mar 13 '14 at 18:10
  • 1
    Which python version? If version 3 the open/read method should produce a unicode string, and the point is to strip off the first BOM character, and the example code I provided is to show you what might be going on in the background - you could just do `s = s[1:]` which will work. Learn to look for what actually is being done and try to understand the logic rather than blindly follow solutions. – metatoaster Mar 13 '14 at 20:25
8

Change utf-8 to utf-8-sig

import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:
RedCarrot
  • 81
  • 1
  • 1
0

Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.

import csv
import lxml.etree

csvFile = 'myData.csv'
xmlFile = 'myData.xml'

reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
  xf.write_declaration(standalone=True)
  with xf.element('root'):
    for row in reader:
      row_el = lxml.etree.Element('row')
      for col in row:
        col_el = lxml.etree.SubElement(row_el, 'col')
        col_el.text = col
      xf.write(row_el)

To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441