Check if bytes result in valid ISO 8859-15 (Latin) in Python

Question

I want to test if a string of bytes that I'm extracting from a file results in valid ISO-8859-15 encoded text. The first thing I came across is this similar case about UTF-8 validation:

https://stackoverflow.com/a/5259160/1209004

So based on that, I thought I was being clever by doing something similar for ISO-8859-15. See the following demo code:

#! /usr/bin/env python
#

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15
    try:
        bytes.decode('iso-8859-15', 'strict')
        return(True)
    except UnicodeDecodeError:
        return(False)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)
    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    isValidLatin = isValidISO885915(bytes)
    print(isValidLatin)

main()

However, running this returns True, even though x95 is not a valid code point in ISO-8859-15! Am I overlooking something really obvious here? (BTW I tried this with Python 2.7.4 and 3.3, results are identical in both cases).

score 1 · Answer 1 · edited May 23 '17 at 12:15

I think I've found a workable solution myself, so I might as well share it.

Looking at the codepage layout of ISO 8859-15 (see here), I really only need to check for the presence of code points 00 -1f and 7f - 9f. These corrrepond to the C0 and C1 control codes.

In my project I was already using something based on the code here for removing control characters from a string (C0 + C1). So, using that as a basis I came up with this:

#! /usr/bin/env python
#
import unicodedata

def removeControlCharacters(string):
    # Remove control characters from string
    # Based on: https://stackoverflow.com/a/19016117/1209004

    # Tab, newline and return are part of C0, but are allowed in XML
    allowedChars = [u'\t', u'\n',u'\r']
    return "".join(ch for ch in string if 
        unicodedata.category(ch)[0] != "C" or ch in allowedChars)

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15

    # Decode bytes to string
    try:
        string = bytes.decode("iso-8859-15", "strict")
    except:
        # Empty string in case of decode error
        string = ""

    # Remove control characters, and compare result against
    # input string
    if removeControlCharacters(string) == string:
        isValidLatin = True
    else:
        isValidLatin = False

    return(isValidLatin)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)

    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    print(isValidISO885915(bytes)) 


main()

There may be more elegant / Pythonic ways to do this, but it seems to do the trick, and works with both Python 2.7 and 3.3.

Just updated this so that tabs, newlines and carriage returns are excluded. — johan, Mar 04 '14 at 11:02

Check if bytes result in valid ISO 8859-15 (Latin) in Python

1 Answers1