Trying to nail down the file encoding of a log file using UnicodeDammit?

Question

Though I understand that it is sometimes impossible to determine a file encoding, I'm trying here.

Bash

In bash file yields:
Non-ISO extended-ASCII text, with CRLF line terminators

vim

In Vim the ex command :set fileencoding=? yields:
fileencoding=latin1

If I open the file normally (see above) I get an <92> (hex 92); but if open the file with :e ++enc=cp1252 I get ’.

I looked up the ’ in wikipedia and it's part of code page 1252 a and that page also states that that code page is in the 80-9F range, so I turned on hlsearch to highlight the matches...and when I do the following searches:

/[^\x80-\x9F] appears to match all characters (I could be wrong about that) since /[\x80-\x9F] matches none!

So this file isn't encoded in cp1252 since all of it's characters fall outside of that range.

Python using `chardet` and Unicode, Dammit!

chardet yields Windows-1252

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

And I tried to use bs4's dependency UnicodeDammit to figure it out but it just returns None

# `cwd`: current directory is straightforward
cwd = Path.cwd()
relPath = '../attempt1_no_extra_fields/again/logfile.txt'
mod_path = Path('__file__').parent
file_path = (mod_path / relPath ).resolve()

with open(file_path, 'rb') as dfe:
    detection = chardet.detect(dfe.read())

print('Chardet:', detection)

with open(file_path, encoding=detection['encoding']) as non_unicode_file:
    data = non_unicode_file.read()

dammit = UnicodeDammit(data,["iso-8859-1","latin-1"])


print("dammit.original_encoding:", dammit.original_encoding)

gives:

`dammit.original_encoding: None`

I turned to Unicode Dammit because it has been said it will give you a better determination of the file encoding.

Trying to nail down the file encoding of a log file using UnicodeDammit?

Bash

vim

Python using chardet and Unicode, Dammit!

0 Answers0

Python using `chardet` and Unicode, Dammit!