How to split a TSV file based on the no of rows

Question

I need to split a tsv with 400000 rows into 4 csv files with 100000 rows.

My sample code:

csvfile = open('./world_formatted.tsv', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+').writelines(csvfile[i:i+100000])
        filename += 1

I am getting this error:

'charmap' codec can't decode byte 0x8d in position 7316: character maps to <undefined>

Your file's encoding is something not supported by your current preferred encoding: https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character — Ilja Everilä, May 22 '18 at 09:45

score 1 · Answer 1 · answered May 22 '18 at 11:51

You might try to use open with the encoding= named parameter, so that Python knows which encoding to read.

Without knowing this (looks like a Windows-CP1252 file according to the hex code, but I might be wrong) you're basically out of luck. On *nix oder MacOS you can use the file command that tries to make an educated guess of the encoding.

Second, you should probably not try to read everything in a list with readlines(). For really large files this is a memory hog. Better stream-read thru the file by iterating as shown below.

MAXLINES = 100000

csvfile = open('./world_formatted.tsv', mode='r', encoding='utf-8')
# or 'Latin-1' or 'CP-1252'
filename = 0
for rownum, line in enumerate(csvfile):
    if rownum % MAXLINES == 0:
        filename += 1
        outfile = open(str(filename) + '.tsv', mode='w', encoding='utf-8')
    outfile.write(line)
outfile.close()
csvfile.close()

I'm sure you close the files after running, just added it to be sure. :-)

If you are on a *nix'ish operating system (or MacOS) you might want to check out the split command that does exactly this (and more): How to split a large text file into smaller files with equal number of lines?

Find a list of the supported standard encodings here: https://docs.python.org/3/library/codecs.html#standard-encodings — Arminius, May 22 '18 at 11:55

score 0 · Accepted Answer · answered May 24 '18 at 09:50

csvfile = open('./formatted.tsv', 'r',encoding="ISO-8859-1").readlines()

filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+',encoding="ISO-8859-1").writelines(csvfile[i:i+100000])
        filename += 1

This is the answer for the question, Thank you all for the help.

How to split a TSV file based on the no of rows

2 Answers2