I'm trying to use Chardet
to deduce the encoding of a very large file (>4 million rows) in tab delimited format.
At the moment, my script struggles presumably due to the size of the file. I'd like to narrow it down to loading the first x number of rows of the file, potentially, but I'm having difficulty when I tried to use readline()
.
The script as it stands is:
import chardet
import os
filepath = os.path.join(r"O:\Song Pop\01 Originals\2017\FreshPlanet_SongPop_0517.txt")
rawdata = open(filepath, 'rb').readline()
print(rawdata)
result = chardet.detect(rawdata)
print(result)
It works, but it only reads the first line of the file. My foray into using simple loops to call readline()
more than once didn't work so well (perhaps the fact that the script is opening the file in binary format).
The output on one line is {'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}
I was wondering whether increasing the number of lines it reads would improve the encoding confidence.
Any help would be greatly appreciated.