Only first character of unicode strings getting written to csv

Question

The nutshell of my problem is that my script cannot write complete unicode strings (retrieved from a db) to a csv, instead only the first character of each string is written to the file. eg:

U,1423.0,831,1,139

Where the output should be:

University of Washington Students,1423.0,831,1,139

Some background: I'm connecting to an MSSQL database using pyodbc. I have my odbc config file set up for unicode, and connect to the db as follows:

p.connect("DSN=myserver;UID=username;PWD=password;DATABASE=mydb;CHARSET=utf-8")

I can get data no problem, but the issue arises when I try to save query results to the csv file. I've tried using csv.writer, the UnicodeWriter solution in the official docs, and most recently, the unicodecsv module I found on github. Each method yields the same results.

The weird thing is I can print the strings in the python console no problem. Yet, if I take that same string and write it to csv, the problem emerges. See my test code & results below:

Code to highlight issue:

print "'Raw' string from database:"
print "\tencoding:\t" + whatisthis(report.data[1][0])
print "\tprint string:\t" + report.data[1][0]
print "\tstring len:\t" + str(len(report.data[1][0]))

f = StringIO()
w = unicodecsv.writer(f, encoding='utf-8')
w.writerows(report.data)
f.seek(0)
r = unicodecsv.reader(f)
row = r.next()
row = r.next()

print "Write/Read from csv file:"
print "\tencoding:\t" + whatisthis(row[0])
print "\tprint string:\t" + row[0]
print "\tstring len:\t" + str(len(row[0]))

Output from test:

'Raw' string from database:
    encoding: unicode string
    print string: University of Washington Students
    string len: 66
Write/Read from csv file:
    encoding: unicode string
    print string: U
    string len: 1

What could be the reason for this issue and how might I resolve it? Thanks!

EDIT: the whatisthis function is just to check the string format, taken from this post

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

Why are you doing r.next() twice? Aren't you printing the 2nd row in the first case and 3rd row in the 2nd case? — Atmaram Shetye, Jun 30 '13 at 20:01
That's just because the first row is the field names, which I added afterwards. I just did the r.next() twice to highlight the problem. — Andrew Hilts, Jun 30 '13 at 20:26
I am having a similar problem while trying to load csv data to mssql using csvsql. All the textual fields are loaded as a single char. Any idea what it might be? What should be the file's encoding? (driver is freetds on Ubuntu) — Zach Moshe, Oct 19 '14 at 15:31

7stud · Answer 1 · 2013-06-30T21:09:14.357

1

import StringIO as sio
import unicodecsv as ucsv

class Report(object):
    def __init__(self, data):
        self.data = data

report = Report( 
  [
     ["University of Washington Students", 1, 2, 3],
     ["UCLA", 5, 6, 7]
  ]
)



print report.data
print report.data[0][0]

print "*" * 20

f = sio.StringIO()
writer = ucsv.writer(f, encoding='utf-8')
writer.writerows(report.data)

print f.getvalue()
print "-" * 20

f.seek(0)

reader = ucsv.reader(f)
row = reader.next()

print row
print row[0]



--output:--
[['University of Washington Students', 1, 2, 3], ['UCLA', 5, 6, 7]]
University of Washington Students
********************
University of Washington Students,1,2,3
UCLA,5,6,7

--------------------
[u'University of Washington Students', u'1', u'2', u'3']
University of Washington Students

Who knows what mischief your whatisthis() function is up to.

edited Jun 30 '13 at 21:09

answered Jun 30 '13 at 20:26

7stud

46,922
14
101
127

If I create the list myself in code, as above, then there's no problem. The issue is due to the unicode data I'm pulling from the DB. Please see my edit that explains the whatisthis function. – Andrew Hilts Jun 30 '13 at 20:37
Why did you even include the whatis() function in your code? How is it relevant? What output do you get for: print type(report.data)? – 7stud Jun 30 '13 at 20:59
And what do you get for: print report.data? – 7stud Jun 30 '13 at 21:07
Here is the output when i print report.data: [(u'U\x00n\x00i\x00v\x00e\x00r\x00s\x00i\x00t\x00y\x00 \x00o\x00f\x00 \x00W\x00a\x00s\x00h\x00i\x00n\x00g\x00t\x00o\x00n\x00 \x00S\x00t\x00u\x00d\x00e\x00n\x00t\x00s\x00', u'2\x000\x001\x003\x00-\x000\x006\x00-\x002\x009\x00', 1423.0, 831, 1, 139)] – Andrew Hilts Jun 30 '13 at 21:08
1

Your database data is not encoded in UTF-8. For each character in report.data, there are two bytes: `U\x00, n\x00, etc.` The second byte is \x00 for each character, which is an empty byte because your characters all fit in the first byte. In other words, your database encoding is using a fixed two bytes per character. The UTF-8 encoding is different. Looking at the mssql docs, it looks like the encoding is ucs2, which is a fixed two bytes per character. python doesn't have a ucs2 codec to decode such strings, but you can try "utf_16_le". – 7stud Jun 30 '13 at 22:33
I am experiencing a similar problem where my text data fields get truncated to a single char when I load to the db using csvkit. Any idea what I can do? Please see my [SE question](http://stackoverflow.com/questions/26361415/text-column-get-truncated-to-1-char-when-loading-with-csvkit-to-mssql) – Amnon Oct 21 '14 at 09:18

Only first character of unicode strings getting written to csv

1 Answers1

Linked