0

I am encoding strings with p.encode('utf-8') into utf-8. I than trying to catch what might have gone wrong with

def assert_encoding(s):
    try:
        if s is None or pd.isnull(s) or (not isinstance(s, basestring)) or s.decode('utf-8') :
            return True
    except UnicodeError:
        return False

A string goes through an assert(encoding(s)) but then an INSERT INTO my Postfres database (configured for UTF-8) fails with the error saying that 0xC3 0x20 is not an UTF-8 supported byte sequence.

  • Is there a loop-hole in assert_encoding?
NoIdeaHowToFixThis
  • 4,484
  • 2
  • 34
  • 69

1 Answers1

0

I think I have perhaps the cause.

Given:

s = 'cil à cil'.decode('latin-1')

which we then encode into utf-8:

'cil à cil'.decode('latin-1').encode('utf-8')

Some cols are to long. I have to shorten them doing something like:

 'cil à cil'.decode('latin-1').encode('utf-8')[0:x]

where x is the number of characters, or at least what I thought was the number of characters.

In reality, by improperly setting x, I might cut the utf-8 string at the wrong point.

'cil à cil'.decode('latin-1').encode('utf-8')[0:7].decode('utf-8')

And in my code, I check the encoding only before shortening the string.

Community
  • 1
  • 1
NoIdeaHowToFixThis
  • 4,484
  • 2
  • 34
  • 69