Background
I'm doing a job for someone that involved downloading ~123,000 US government court decisions stored as text files (.txt), which seem to be generally encoded in the Windows 1252 format, but are apparently occasionally encoded in the UCS-2 LE BOM format (according to Notepad++). They may also occasionally use other formats; I haven't figured out how to quickly get a complete list.
Problem
This variability in the encoding is preventing me from examining the UCS-2 files using Python.
I'd like a quick way to convert all of the files to UTF-8, regardless of their original encoding.
I have access to both a Linux and a Windows machine, so I can use solutions specific to either OS.
What I've tried
I tried using Python's cchardet
library, but it doesn't seem to be as good at detecting the encoding as Notepad++ is, as the library is telling me that a certain file is using the Windows-1252 encoding when Notepad++ is saying it's actually using the UCS-2 LE BOM encoding.
import os
import re
import cchardet
def print_the_encodings_used_by_all_files_in_a_directory():
path_to_cases = '<fill this in>'
encodings = set()
detector = cchardet.UniversalDetector()
for index, filename in enumerate(os.listdir(path_to_cases)):
path_to_file = os.path.join(path_to_cases, filename)
detector.reset()
with open(path_to_file, 'rb') as infile:
for line in infile.readlines():
detector.feed(line)
if detector.done:
break
detector.close()
encodings.add(detector.result['encoding'])
print(encodings)
Here's what a hex editor shows as the first two bytes of the file in question: