1

I have a problem reading unicode characters from a csv. The csv file originally had elements with unicode tags:

  1. "[u'Aeron\xe1utica']"
  2. "[u'Ni\u0161']"
  3. "[u'K\xfcnste']" ...

from which I had to remove the u'' tags to give a csv with

  1. Aeron\xe1utica
  2. Ni\u0161
  3. K\xfcnste ....

Now I want to read the csv and output it into a file with the characters i.e.

  1. Aeronáutica
  2. Niš
  3. Künste ....

I tried using the UnicodeWriter in the csv docs, but it gives the same output as the second list

Here's what I did to read and write:

c = open('foo.csv','r')
r = csv.reader(c)
for row in reader:
p = p + row
#The elements in p were ['Aeron\\xe1utica', 'Ni\\u0161', 'K\\xfcnste'...]
c = open('bar.csv','w')
c.write(codecs.BOM_UTF8)
writer = UnicodeWriter(c)
for row in p:
writer.writerow([row])

I also tried codecs.open('','','UTF-8') for both reading and writing, but it didn't help

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
KBhokray
  • 117
  • 1
  • 10
  • 2
    No, you do *not* need to remove the `u`. Those are Unicode values, you *want* unicode values. – Martijn Pieters Jul 08 '13 at 11:43
  • 1
    And when reading a CSV with encoded characters, why not use the `UnicodeReader`? – Martijn Pieters Jul 08 '13 at 11:44
  • 1
    To clarify: Is `[u'Aeron\xe1utica']` the literal text inside your file - if not - what is? – Jon Clements Jul 08 '13 at 11:44
  • @MartijnPieters it was necessary for the job to remove them. I removed them in a spreadsheet. – KBhokray Jul 08 '13 at 11:53
  • @KBhokray: Then you did something wrong *creating* the spreadsheet. You are looking at the `repr()` string representation, a debugging aid. When turning a list to a string (when printing for example), all the contents are shown as `repr()` values, which is a `str` value that represents the actual contents of each element. – Martijn Pieters Jul 08 '13 at 11:54
  • @MartijnPieters Yeah, but this CSV is all I have now, it can't be changed. Also, UnicodeReader also gives the same results – KBhokray Jul 08 '13 at 12:17
  • @JonClements the _original CSV_ had very bad formatting. to give a single row, right out of it: [u'abc\u014d'],['$'],,['$'],['N/A'],['$'],"['', '', '', '', '']",['$'],"['', '', '', '', '']",['$'],['Not Available'],['@'] the ['$'] were supposed to help separate the values – KBhokray Jul 08 '13 at 12:19
  • @KBhokray: please add a few sample lines to your post. Use the 4-space indentation convention to format it (just like with code); the `{}` button on the toolbar can help with that. – Martijn Pieters Jul 08 '13 at 12:20

1 Answers1

0

It appears you have written Python lists directly to your CSV file, resulting in the [...] literal syntax instead of normal columns. You then removed most of the information that could have been used to turn the information back to Python lists with unicode strings again.

What you have left are Python unicode literals, but without the quotes. Use the unicode_escape to decode the values to Unicode again:

with open('foo.csv','r') as b0rken
    for line in b0rken:
        value = line.rstrip('\r\n').decode('unicode_escape')
        print value

or add back the u'..' quoting, using a triple-quoted string in an attempt to avoid needing to escape embedded quotes:

with open('foo.csv','r') as b0rken
    for line in b0rken:
        value = literal_eval("u'''{}'''".format(line.rstrip('\r\n')))
        print value

If you still have the original file (with the [u'...'] formatted lines), use the ast.literal_eval() function to turn those back into Python lists. No point in using the CSV module here:

from ast import literal_eval

with open('foo.csv','r') as b0rken
    for line in b0rken:
        lis = literal_eval(line)
        value = lis[0]
        print value

Demo with unicode_escape:

>>> for line in b0rken:
...     print line.rstrip('\r\n').decode('unicode_escape')
... 
Aeronáutica
Niš
Künste
École de l'Air
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thanks, _almost there_, but for a little issue. Some elements already have ' character in them ex: \xc9cole de l'Air. this is giving the error File "", line 1 u'\xc9cole de l'Air' ^ SyntaxError: invalid syntax Of course, I can substitute it easily to get past it, but any direct method will be more helpful – KBhokray Jul 08 '13 at 12:33
  • @KBhokray: just use the `unicode_escape` approach or use a triple-quoted string. – Martijn Pieters Jul 08 '13 at 12:36