You might try to use open
with the encoding=
named parameter, so that Python knows which encoding to read.
Without knowing this (looks like a Windows-CP1252 file according to the hex code, but I might be wrong) you're basically out of luck. On *nix oder MacOS you can use the file
command that tries to make an educated guess of the encoding.
Second, you should probably not try to read everything in a list with readlines()
. For really large files this is a memory hog. Better stream-read thru the file by iterating as shown below.
MAXLINES = 100000
csvfile = open('./world_formatted.tsv', mode='r', encoding='utf-8')
# or 'Latin-1' or 'CP-1252'
filename = 0
for rownum, line in enumerate(csvfile):
if rownum % MAXLINES == 0:
filename += 1
outfile = open(str(filename) + '.tsv', mode='w', encoding='utf-8')
outfile.write(line)
outfile.close()
csvfile.close()
I'm sure you close the files after running, just added it to be sure. :-)
If you are on a *nix'ish operating system (or MacOS) you might want to check out the split
command that does exactly this (and more): How to split a large text file into smaller files with equal number of lines?