0

I having lists of Dicts in python where almost all my Dict objects has at least a key having string value containing Non-UTF-8 Character. I want to keep them all as it is and insert them to my database and get it back later Using API.

here how my list of Dicts looks like

items=[
            {
                
                "name": "World Bank (USA)",
                "shortName": "WB",
                "description": "<p><strong>WB - World Bank</strong> - is an international financial institution that provides loans to developing countries for capital programs. The World Bank's official goal is the reduction of poverty.</p><p> </p><p> </p>",
               
                "legalResidence": "USA",
               
            },.....]

as in the description key, its value is having Html tags inside the string and it raises this error for me

SyntaxError: Non-UTF-8 code starting with '\xa0' 

How can I ignore this error and let my string be as it is?

this question has few answers and in all of them, they remove or replace these characters where I don't want to go for it. enter link description here

Talib Daryabi
  • 733
  • 1
  • 6
  • 28
  • "almost all my Dict objects has at least a key having string value containing Non-UTF-8 Character" - please explain. – Grismar May 26 '21 at 04:31
  • I mean like the first dictionary in my list other Dict objects is also having some type of same characters in its string – Talib Daryabi May 26 '21 at 04:40
  • UTF-8 is an encoding. That error was raised when you were trying to decode a bytes object. Were you reading a file? Getting a web page? The code where this error is hit is what we need to see. And the traceback message which will show us more information. You can "keep" the odd character by using bytes objects instead of strings, but likely the best thing is to figure out the correct encoding and using that instead of UTF-8. – tdelaney May 26 '21 at 04:41
  • @tdelaney figuring out the correct encoding is what I am trying to do – Talib Daryabi May 26 '21 at 04:43
  • 1
    So you're not going to tell us where you get this error? Kinda pointless question then. – tdelaney May 26 '21 at 04:44

1 Answers1

1

The problem here is that you are telling Python that your source code is UTF-8 (which is the default), when in fact it is not UTF-8. 0xA0 is the "non-breaking space" in the default Windows-1252 character set. If that's where you got these strings, then you can try putting this comment at the top of your file:

# -*- coding: Windows-1252 -*-

and see if that lets things pass. The PROPER way to handle this is to convert those non-breaking spaces to regular spaces before putting them in your source code.

Tim Roberts
  • 48,973
  • 4
  • 21
  • 30