0

I'm trying to insert into a table, but it seems that the file I opened has non-ascii characters in it. This is the error I got:

sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

So after doing some research, I tried putting this in my code:

encode("utf8","ignore")

Which then gave me this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0x92 in position 9: ordinal not in range(128)

So then I tried using the codecs library and open the file like this:

codecs.open(fileName, encoding='utf-8')

which gave me this error:

newchars, decodedbytes = self.decode(data, self.errors)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 0: invalid start byte

Then instead of utf-8, I used utf-16 to see if that would do anything and I got this error:

raise UnicodeError,"UTF-16 stream does not start with BOM" UnicodeError: UTF-16 stream does not start with BOM

I'm all out of ideas... Also I'm using Ubuntu, if it helps.

wpakt
  • 1,073
  • 2
  • 13
  • 18
  • 2
    You need to now what encoding the file you're opening is. – Benjamin Peterson May 13 '13 at 19:36
  • The first `UnicodeDecodeError` is thrown because you are trying to encode bytes, which requires you to *decode* to Unicode first. – Martijn Pieters May 13 '13 at 19:41
  • It says: text/plain; charset=unknown-8bit – wpakt May 13 '13 at 19:43
  • 1
    You can use https://pypi.python.org/pypi/chardet to guess the encoding. – Thomas Fenzl May 13 '13 at 19:46
  • @MartijnPieters Do I have to turn it into a unicode string first like so: http://stackoverflow.com/a/1211102/2379053 in order to ignore the unicode characters? Is python, by default, just reading in the strings as if they are ascii? – wpakt May 13 '13 at 19:52
  • @user2379053: You really want to read the [Python Unicode HOWTO](http://docs.python.org/2/howto/unicode.html); python reads byte strings; characters of with values between 0 and 255, regardless of encoding. – Martijn Pieters May 13 '13 at 19:53
  • @user2379053: You can then interpret those bytes as a specific encoding by using `.decode()` to get unicode values. You shouldn't ignore anything to get there, that's like using a chainsaw to make your shiny car fit into a garage without opening the door first. – Martijn Pieters May 13 '13 at 19:54
  • @user2379053: Instead, figure out the correct encoding, treat it like a key to open the garage door first. – Martijn Pieters May 13 '13 at 19:55
  • @user2379053: What you linked to is going the *other way*, unicode encoding to byte strings. That's not the direction you want to go here. – Martijn Pieters May 13 '13 at 19:56
  • Since stack overflow won't let me answer my own question yet... This is what happened: The problem was that the file doesn't know what encoding it is. I used: **file -bi [filename]** to find out what encoding the file is and got: **text/plain; charset=unknown-8bit**. So I went into my text editor (Sublime) to see if it would work if I saved it with encoding: utf-8. Then I ran my script (with the codecs library) using that file and it worked. Thanks for everyone's help. :) – wpakt May 13 '13 at 20:35

0 Answers0