1

Background

I'm doing a job for someone that involved downloading ~123,000 US government court decisions stored as text files (.txt), which seem to be generally encoded in the Windows 1252 format, but are apparently occasionally encoded in the UCS-2 LE BOM format (according to Notepad++). They may also occasionally use other formats; I haven't figured out how to quickly get a complete list.

Problem

This variability in the encoding is preventing me from examining the UCS-2 files using Python.

I'd like a quick way to convert all of the files to UTF-8, regardless of their original encoding.

I have access to both a Linux and a Windows machine, so I can use solutions specific to either OS.

What I've tried

I tried using Python's cchardet library, but it doesn't seem to be as good at detecting the encoding as Notepad++ is, as the library is telling me that a certain file is using the Windows-1252 encoding when Notepad++ is saying it's actually using the UCS-2 LE BOM encoding.

import os
import re

import cchardet


def print_the_encodings_used_by_all_files_in_a_directory():
    path_to_cases = '<fill this in>'
    encodings = set()
    detector = cchardet.UniversalDetector()

    for index, filename in enumerate(os.listdir(path_to_cases)):
        path_to_file = os.path.join(path_to_cases, filename)
        detector.reset()
        with open(path_to_file, 'rb') as infile:
            for line in infile.readlines():
                detector.feed(line)
                if detector.done:
                    break
        detector.close()

        encodings.add(detector.result['encoding'])
    print(encodings)

Here's what a hex editor shows as the first two bytes of the file in question: enter image description here

Nathan Wailes
  • 9,872
  • 7
  • 57
  • 95
  • 1
    You have a BOM of `FF FE`, which means UTF-16 LE. What is the problem? [Convert UTF-16 to UTF-8 and remove BOM?](https://stackoverflow.com/q/8827419/608639), [Converting UTF-16 to UTF-8](https://stackoverflow.com/q/31207287/608639), [Converting from utf-16 to utf-8 in Python3](https://stackoverflow.com/q/3140010/608639), [Convert UTF16LE file to UTF8 in Python?](https://stackoverflow.com/q/4633444/608639), etc. – jww Jul 08 '19 at 07:28
  • @jww The problem is that the cchardet library does not seem to be detecting it as UTF-16 LE. When I run the code you see in the question, it says that every file in the directory is using the 1252 encoding, with confidence "0.5". – Nathan Wailes Jul 08 '19 at 19:23

0 Answers0