0

I'm having trouble using the unicodecsv reader. I keep looking for different examples of how to use the module, but everyone keeps referencing the exact sample from the unicodecsv website (or some similar variation).

import unicodecsv as csv
from io import BytesIO
f = BytesIO()
w = csv.writer(f, encoding='utf-8')
_ = w.writerow((u'é', u'ñ'))
_ = f.seek(0)
r = csv.reader(f, encoding='utf-8')
next(r) == [u'é', u'ñ']
>>> True

For me this example makes too many assumptions about our understanding. It doesn't look like a csv file is being passed. I've completely missed the plot.

What I want to do is:

  1. Read the first line of the csv file which are headers
  2. Read the remaining lines and put them in a dictionary

My broken code:

import unicodecsv
#
i = 0
myCSV = "$_input.csv"
dic = {}
#
f = open(myCSV, "rb")
reader = unicodecsv.reader(f, delimiter=',')
strHeader = reader.next()
#
# read the first line of csv
# use custom function to parse the header
myHeader = FNC.PARSE_HEADER(strHeader)
#
# read the remaining lines
# put data into dictionary of class objects
for row in reader:
    i += 1
    dic[i] = cDATA(myHeader, row)

And, as expected, I get the 'UnicodeDecodeError'. Maybe the example above has the answers, but they are just completely going over my head.

Can someone please fix my code? I'm running out of hair to pull out.

I switched the reader line to:

reader = unicodecsv.reader(f, encoding='utf-8')

Traceback: for row in reader: File "C:\Python27\unicodecsv\py2.py", line 128 in next for value in row]

UnicodeDecodeError: 'utf8' codec can't decode byte 0x90 in position 48: invalide start byte

When I strictly print the data using:

f = open(myCSV, "rb")
reader = csv.reader(f, delimiter=',')
for row in reader:
    print(str[row[9]] + '\n')
    print(repr(row[9] + '\n')
>>> UTAS ? Offline
>>> 'UTAS ? Offline'
twegner
  • 443
  • 1
  • 5
  • 21

2 Answers2

5

You need to declare the encoding of the input file when creating the reader, just like you did when creating the writer:

>>> import unicodecsv as csv
>>> with open('example.csv', 'wb') as f:
...     writer = csv.writer(f, encoding='utf-8')
...     writer.writerow(('heading0', 'heading1'))
...     writer.writerow((u'é', u'ñ'))
...     writer.writerow((u'ŋ', u'ŧ'))
... 
>>> with open('example.csv', 'rb') as f:
...     reader = csv.reader(f, encoding='utf-8')
...     headers = next(reader)
...     print headers
...     data = {i: v for (i, v) in enumerate(reader)}
...     print data
... 
[u'heading0', u'heading1']
{0: [u'\xe9', u'\xf1'], 1: [u'\u014b', u'\u0167']}

Printing the dictionary shows the escaped representation of the data, but you can see the characters by printing them individually:

>>> for v in data.values():
...     for s in v:
...         print s
... 
é
ñ
ŋ
ŧ

EDIT:

If the encoding of the file is unknown, then it's best to use some like chardet to determine the encoding before processing.

snakecharmerb
  • 47,570
  • 11
  • 100
  • 153
  • 1) You are showing the 'writer' then the 'reader'. Is the writer needed? Or is that only if someone is creating the csv file? In my case someone is sending me a csv file and I am processing the information. My code worked fine until one day one of the fields had Unicode characters added. 2) I tried adding the "endcoding='utf-8' " on the 'reader' line and it threw an error - something along the lines of not recognized input string for that parameter - I'm writing this from memory, I'm not at my workstation. – twegner Apr 04 '16 at 17:12
  • (1) The writer section is just for the purposes of the example. The reader code is independent of how the file is created (though it assumes a valid csv file encoded as utf-8. (2) Please edit your question with your new code and the full traceback when you get the opportunity. A sample of the "unicode" in your file might be helpful too. – snakecharmerb Apr 04 '16 at 18:05
  • Maybe your data isn't encoded as utf-8. Based on http://stackoverflow.com/questions/6180521/unicodedecodeerror-utf8-codec-cant-decode-bytes-in-position-3-6-invalid-dat, try changing the encoding to 'latin-1'. There are ISO-8859-X encodings for various languages that you could try see https://en.wikipedia.org/wiki/ISO/IEC_8859-1 – snakecharmerb Apr 05 '16 at 15:17
  • Yes, I figured as much. Some of the data is international. It hasn't been a problem till now. I'm worried that if I choose a specific encoding it will work this time, but maybe not next time when data/sources change. I want to share this code with others but I don't want it to be problematic. Is there a way to fool proof it with many different encodings? Perhaps a bunch of TRY statements? – twegner Apr 05 '16 at 16:16
  • In that case I think you need proper tool. I've made a suggestion in my answer. I think that's as far as we can go with this question. Good luck! – snakecharmerb Apr 05 '16 at 16:28
0

If your final goal is read csv file and convert data into dicts then I would recommend using csv.DictReader. DictRead will take care of reading header and converting rest of the rows into Dict (rowdicts). This uses CSV moduels, which contains lots of documentation/example available.

>>> import csv
>>> with open('names.csv') as csvfile:
...     reader = csv.DictReader(csvfile)
...     for row in reader:
...         print(row['first_name'], row['last_name'])

To get more clarity you check examples here https://docs.python.org/2/library/csv.html#csv.DictReader

sam
  • 1,819
  • 1
  • 18
  • 30
  • The csv file contains Unicode ?utf-8 characters so I need to use the unicodecsv module, not the regular csv one. When I remove the Unicode fields from the csv file the code works fine. Its the Unicode and how to process it that escapes me. – twegner Apr 04 '16 at 16:52
  • Then lets convert this utf-8 to ascii format. In case if you are using Python3,. then you are in luck. UTF8 is now the standard format for python3. Otherwise we have lots of tools & methods to convert file encoding formats. Even Notepad++ can help you with it. Try it. Good luck. – sam Apr 06 '16 at 12:16
  • Also try to check this http://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python – sam Apr 06 '16 at 12:16