Catch 0xC3 in a ut8 encoded string

Question

I am encoding strings with p.encode('utf-8') into utf-8. I than trying to catch what might have gone wrong with

def assert_encoding(s):
    try:
        if s is None or pd.isnull(s) or (not isinstance(s, basestring)) or s.decode('utf-8') :
            return True
    except UnicodeError:
        return False

A string goes through an assert(encoding(s)) but then an INSERT INTO my Postfres database (configured for UTF-8) fails with the error saying that 0xC3 0x20 is not an UTF-8 supported byte sequence.

Is there a loop-hole in assert_encoding?

On what object does the insertion fail? And wouldn't `not isinstance(s, basestring)` cover the two cases `s is None` and `pd.isnull(s)` already? — karlson, Jan 18 '16 at 11:32
Maybe, but so does `not isinstance(s, basestring)`, doesn't it? After all `np.nan` is not a string. — karlson, Jan 18 '16 at 14:00
You're right. that also holds for `inf`. I am afraid I have confused missing values with the column type. If you answer, I'll accept it. — NoIdeaHowToFixThis, Jan 18 '16 at 15:22
Mine was not an answer to your question, just nitpicking :). You should post your edit as an answer instead and then accept that. — karlson, Jan 18 '16 at 15:31

score 0 · Accepted Answer · edited May 23 '17 at 11:44

I think I have perhaps the cause.

Given:

s = 'cil à cil'.decode('latin-1')

which we then encode into utf-8:

'cil à cil'.decode('latin-1').encode('utf-8')

Some cols are to long. I have to shorten them doing something like:

 'cil à cil'.decode('latin-1').encode('utf-8')[0:x]

where x is the number of characters, or at least what I thought was the number of characters.

In reality, by improperly setting x, I might cut the utf-8 string at the wrong point.

'cil à cil'.decode('latin-1').encode('utf-8')[0:7].decode('utf-8')

And in my code, I check the encoding only before shortening the string.

Catch 0xC3 in a ut8 encoded string

1 Answers1