Though I understand that it is sometimes impossible to determine a file encoding, I'm trying here.
Bash
In bash file
yields:
Non-ISO extended-ASCII text, with CRLF line terminators
vim
In Vim the ex
command :set fileencoding=?
yields:
fileencoding=latin1
If I open the file normally (see above) I get an <92>
(hex 92); but if open the file with :e ++enc=cp1252
I get ’
.
I looked up the ’
in wikipedia and it's part of code page 1252 a and that page also states that that code page is in the 80-9F
range, so I turned on hlsearch
to highlight the matches...and when I do the following searches:
/[^\x80-\x9F]
appears to match all characters (I could be wrong about that) since /[\x80-\x9F]
matches none!
So this file isn't encoded in cp1252
since all of it's characters fall outside of that range.
Python using chardet
and Unicode, Dammit!
chardet
yields Windows-1252
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
And I tried to use bs4's dependency UnicodeDammit to figure it out but it just returns None
# `cwd`: current directory is straightforward
cwd = Path.cwd()
relPath = '../attempt1_no_extra_fields/again/logfile.txt'
mod_path = Path('__file__').parent
file_path = (mod_path / relPath ).resolve()
with open(file_path, 'rb') as dfe:
detection = chardet.detect(dfe.read())
print('Chardet:', detection)
with open(file_path, encoding=detection['encoding']) as non_unicode_file:
data = non_unicode_file.read()
dammit = UnicodeDammit(data,["iso-8859-1","latin-1"])
print("dammit.original_encoding:", dammit.original_encoding)
gives:
`dammit.original_encoding: None`
I turned to Unicode Dammit
because it has been said it will give you a better determination of the file encoding.