2

I'm exporting feeds from alien vault otx using staxii and trying to send them to misp. But when sending some feeds, the following error occurs:

UnicodeEncodeError: 'latin-1' codec can't encode character '\u2019' in position 3397: Body ('’') is not valid Latin-1. Use body.encode('utf-8') if you want to send it encoded in UTF-8.

for filename in os.listdir(dest_directory):
filenameWithDir = dest_directory+filename
try:
    file_index += 1
    print("****************")
    print(dest_directory + filename)
    print(file_index)
    print("****************")
    misp_config.upload_stix(filenameWithDir, '1')
except UnicodeEncodeError:
    with open(filenameWithDir, 'r') as file:
        filedata = file.read()
        filedata = filedata.replace('вЂ', ' ').replace('’', ' ').replace('“', ' ').replace('”', ' ')\
            .replace('–', ' ').replace('—', ' ').replace('™', ' ').replace('​', ' ').replace(' ', ' ')\
            .replace(' ', ' ').replace('…', ' ').replace(' ', ' ').replace('미북 정상회담 전망 및 대비', ' ')\
            .replace(',', ' ').replace('•', ' ').replace('‑', ' ')

    with open(filenameWithDir, 'w') as file:
        file.write(filedata)
    file_index += 1
    print("****************")
    print(dest_directory + filename)
    print(file_index)
    print("****************")
    misp_config.upload_stix(filenameWithDir, '1')

I tried to replace characters that are not readable, but there are too many of them. Is it possible to delete characters by the position indicated in the error?

mdrnjss
  • 21
  • 1
  • 1
  • 3
  • does this answer your question: https://stackoverflow.com/questions/51157481/unicode-encode-error-latin-1-codec-cant-encode-character-u2019 – Raphael Nov 10 '20 at 13:27
  • 1
    The error shows how to do it. Use `Object.encode('utf-8')` to encode it to `utf-8` – Barış Çiçek Nov 10 '20 at 13:27
  • Probably a duplicate of https://stackoverflow.com/questions/10611455/what-is-character-encoding-and-why-should-i-bother-with-it; see also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Nov 10 '20 at 13:43

1 Answers1

3

This is basically a Unicode-problem, that would happen in any unicode-aware language. Fundamentals:

  • Unicode is a standard that aims to define a single well known code (and name) for any writing system known.
  • An encoding is how Unicode code points ("characters") are stored and transmitted using one or more bytes.

There are encodings that make it possible to store any random Unicode code point (e.g. UTF-8, UTF-16) als well as encodings that permit only a subset of Unicode code points - e.g the ISO 8859-1 (aka Latin-1) encoding that supports only a small superset of ASCII.

Python translates between Unicode data (str) and byte data (bytearray) using .encode (for str → bytearray) and .decode (for bytearray → str). Your code (or something that is called by your code) apparently uses .encode('latin-1'), but this encoding fails for the Right Single Quotation Mark \u2019 as Latin-1 does not support this character.

You can use another encoding to send that character. The UTF-8 encoding is a good choice for that, but your counterpart MUST be configured to use this encoding as well, otherwise you will receive some Mojibake where the other side interprets your UTF-8 data as Latin-1 and your character could show up up as ’.

If you are using Windows, it is likely that your source-data was using Windows-1252 instead of Latin-1 – this encoding is quite similar and has an encoding for your Right Single Quotation Mark, so maybe Windows-1252 could be a better choice of encoding.

nd.
  • 8,699
  • 2
  • 32
  • 42
  • That last paragraph should probably be emphasized more. "Smart quotes" are one of the bigger (in the sense of most commonly seen) differences between latin-1 and cp1252 (a bunch of largely unused control codes in latin-1 are used for characters in cp1252). And cp1252 isn't purely a Windows thing either; "Windows-1252" is the *default* encoding for `text/` MIME types in HTML5 (yeah, I don't know why they did that), so after UTF-8, it's probably the single most commonly used character encoding on the web. The OP's source data being cp1252 is likely even if they're not personally on Windows. – ShadowRanger Nov 10 '20 at 14:39
  • Oh, and of course, the source data could just be regular data saved in any encoding that was produced by Microsoft products with smart-quotes enabled. Even if it got saved as UTF-8 or UTF-16 instead of cp1252, you'd expect to see smart quotes a lot. – ShadowRanger Nov 10 '20 at 14:41