24

As stated in title, I would like to check in given file object (opened as binary stream) is valid UTF-8 file.

Anyone?

Thanks

malat
  • 12,152
  • 13
  • 89
  • 158
Jox
  • 7,132
  • 14
  • 49
  • 63

4 Answers4

32
def try_utf8(data):
    "Returns a Unicode object on success, or None on failure"
    try:
       return data.decode('utf-8')
    except UnicodeDecodeError:
       return None

data = f.read()
udata = try_utf8(data)
if udata is None:
    # Not UTF-8.  Do something else
else:
    # Handle unicode data
Daniel Stutzbach
  • 74,198
  • 17
  • 88
  • 77
  • Obviously I didn't do my homework good enough when there is more that one solution simple as this :( Thanks! – Jox Jul 16 '10 at 23:53
14

You could do something like

import codecs
try:
    f = codecs.open(filename, encoding='utf-8', errors='strict')
    for line in f:
        pass
    print "Valid utf-8"
except UnicodeDecodeError:
    print "invalid utf-8"
michael
  • 451
  • 3
  • 7
  • 1
    Could be simpler by using only one line: `codecs.open("path/to/file", encoding="utf-8", errors="strict").readlines()` instead of 3. – colidyre May 07 '19 at 19:06
0

If anyone needed a script to find all non utf-8 files in current dir: import os

def try_utf8(data):
    try:
        return data.decode('utf-8')
    except UnicodeDecodeError:
        return None


for root, _, files in os.walk('.'):
    if root.startswith('./.git'):
        continue
    for file in files:
        if file.endswith('.pyc'):
            continue
        path = os.path.join(root, file)
        with open(path, 'rb') as f:
            data = f.read()
            data = try_utf8(data)
            if data is None:
                print(path)
Vulwsztyn
  • 2,140
  • 1
  • 12
  • 20
0

In Python 3, you can do something like this:

with open(filename, 'rb') as f:
    try:
        f.read().decode('UTF-8')
        is_utf8 = True
    except UnicodeDecodeError:
        is_utf8 = False

print(is_utf8)
Flimm
  • 136,138
  • 45
  • 251
  • 267