0

I have a large Pandas dataframe (over 2 millions rows), with the following columns:

Id,CandidateRegistrationID,CandidateID,OurReference,QualificationCode,ExamCode,ExamDate,QualificationName,DataSource,QuestionNo,CandidateResponse,CorrectAnswerChoice,UniquePaperNo,QuestionCode

I have a function to write the dataframe to sqlite:

def writeDF(df,db,table):
    conn = sqlite3.connect(db)
    conn.text_factory = str  # allows utf-8 data to be stored

    df.to_sql(table, conn, flavor='sqlite', schema=None, if_exists='replace', index=False, index_label=None, chunksize=None, dtype=None)

    conn.close()

On cut-down versions of the data this works fine. On the full data-set I get the following error:

ValueError: Cannot convert identifier to UTF-8: 'Id'

The Id field is just an integer.

I'd welcome any insights. Googling just takes me to the line in Pandas that makes the error.

Traceback (most recent call last):
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1836, in __call__
  return self.wsgi_app(environ, start_response)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1820, in wsgi_app
  response = self.make_response(self.handle_exception(e))
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1403, in handle_exception
  reraise(exc_type, exc_value, tb)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1817, in wsgi_app
  response = self.full_dispatch_request()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
  rv = self.handle_user_exception(e)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
  reraise(exc_type, exc_value, tb)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
  rv = self.dispatch_request()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/flask/app.py", line 1461, in dispatch_request
  return self.view_functions[rule.endpoint](**req.view_args)
File "/py-csv-jmetrik/app/routes.py", line 69, in index
  writeDF(data_df,db,table)
File "/py-csv-jmetrik/app/routes.py", line 27, in writeDF
  df.to_sql(table, conn, flavor='sqlite', schema=None, if_exists='replace', index=True, index_label=None, chunksize=None, dtype=None)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/core/generic.py", line 982, in to_sql
  dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 549, in to_sql
  chunksize=chunksize, dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1565, in to_sql
  dtype=dtype)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 627, in __init__
  self.table = self._create_table_setup()
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1377, in _create_table_setup
  for cname, ctype, _ in column_names_and_types]
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1297, in _get_valid_sqlite_name
  uname = _get_unicode_name(name)
File "/py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py", line 1271, in _get_unicode_name
  raise ValueError("Cannot convert identifier to UTF-8: '%s'" % name)
Leb
  • 15,483
  • 10
  • 56
  • 75
  • The http://stackoverflow.com/questions/3425320/sqlite3-programmingerror-you-must-not-use-8-bit-bytestrings-unless-you-use-a-te question looks similar to this, although I don't think it's a duplicate. But its answer (use `sqlite3.Binary` as the `text_factory`) might be relevant to the OP's problem. Worth a look. – rmunn Oct 08 '15 at 14:30
  • @rmunn I'm not sure about that, as http://stackoverflow.com/questions/3425320/sqlite3-programmingerror-you-must-not-use-8-bit-bytestrings-unless-you-use-a-te is referring to storing blobs. All I'm doing is storing text strings and numbers. – Mark Berthelemy Oct 09 '15 at 08:06
  • Nope, you're right, that's not the problem you're encountering. What's happening to you is that the *column name* "Id" is failing to convert to Unicode. Which is weird: when I look at the source code to `pandas/io/sql.py`, I don't see any reason why it should fail. – rmunn Oct 12 '15 at 03:07

2 Answers2

1

Maybe it's a little bit late for your problem but could help other programmers, I had exact the same issue as you had, I was reading a csv using pandas and then trying to insert on a bd sqlite.

The solution that worked for me was declaring the enconding keyword when reading csv file:

import pandas as pd
import sqlite3
conn = sqlite3.connect('dbpath', isolation_level= None, check_same_thread=False)
df = pd.read_csv('csvfile.csv', encoding = "UTF-8-sig")
df.to_sql('tablename', con=conn, if_exists= 'append')
0

At first glance, I don't see any reason why this should be happening. This is the function within Pandas that's throwing the error you're seeing:

def _get_unicode_name(name):                                                    
    try:                                                                        
        uname = name.encode("utf-8", "strict").decode("utf-8")                  
    except UnicodeError:                                                        
        raise ValueError("Cannot convert identifier to UTF-8: '%s'" % name)     
    return uname                                                                

The only way that would fail is if either encoding the string "Id" to UTF-8 failed, or if decoding that UTF-8 string failed. And there's nothing I can see about the name "Id" that should be causing a failure.

Try this. Since you're using Python, an interpreted language, take advantage of that fact and edit the source code of the library you're using. Edit /py-csv-jmetrik/venv/lib/python2.7/site-packages/pandas/io/sql.py and change the function above to:

def _get_unicode_name(name):                                                    
    try:                                                                        
        utf8name = name.encode("utf-8", "strict")
    except UnicodeError:                                                        
        raise ValueError("Cannot encode identifier to UTF-8: '%s'" % utf8name)     
    try:                                                                        
        uname = utf8name.decode("utf-8")                  
    except UnicodeError:                                                        
        raise ValueError("Cannot decode UTF-8: '%s'" % utf8name)     
    return uname                                                                

That will at least tell you which of the two operations is failing. Then run your program as follows:

python myscript.py >stdout.txt 2>stderr.txt

Then look at stderr.txt through a microscope (i.e., pipe the last couple lines or the first couple lines to xxd) to see what character values ended up in it:

head -n 2 stderr.txt | xxd
tail -n 2 stderr.txt | xxd

What you want to do is capture the line with the ValueError where it gives you the name of the identifier that's causing the error (in this case, "Id"). See if there are any weird characters in your "Id" identifier, like zero-width spaces or anything like that. That's the only thing I can think of right now. It might not help, but at least it will narrow down the problem a little bit... maybe.

rmunn
  • 34,942
  • 10
  • 74
  • 105