0

I need to split a tsv with 400000 rows into 4 csv files with 100000 rows.

My sample code:

csvfile = open('./world_formatted.tsv', 'r').readlines()
filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+').writelines(csvfile[i:i+100000])
        filename += 1

I am getting this error:

'charmap' codec can't decode byte 0x8d in position 7316: character maps to <undefined>
Ralph519
  • 85
  • 2
  • 11
Code explore
  • 1
  • 1
  • 5

2 Answers2

1

You might try to use open with the encoding= named parameter, so that Python knows which encoding to read.

Without knowing this (looks like a Windows-CP1252 file according to the hex code, but I might be wrong) you're basically out of luck. On *nix oder MacOS you can use the file command that tries to make an educated guess of the encoding.

Second, you should probably not try to read everything in a list with readlines(). For really large files this is a memory hog. Better stream-read thru the file by iterating as shown below.

MAXLINES = 100000

csvfile = open('./world_formatted.tsv', mode='r', encoding='utf-8')
# or 'Latin-1' or 'CP-1252'
filename = 0
for rownum, line in enumerate(csvfile):
    if rownum % MAXLINES == 0:
        filename += 1
        outfile = open(str(filename) + '.tsv', mode='w', encoding='utf-8')
    outfile.write(line)
outfile.close()
csvfile.close()

I'm sure you close the files after running, just added it to be sure. :-)

If you are on a *nix'ish operating system (or MacOS) you might want to check out the split command that does exactly this (and more): How to split a large text file into smaller files with equal number of lines?

Arminius
  • 1,029
  • 7
  • 11
  • Find a list of the supported standard encodings here: https://docs.python.org/3/library/codecs.html#standard-encodings – Arminius May 22 '18 at 11:55
0
csvfile = open('./formatted.tsv', 'r',encoding="ISO-8859-1").readlines()

filename = 1
for i in range(len(csvfile)):
    if i % 100000 == 0:
        open(str(filename) + '.tsv', 'w+',encoding="ISO-8859-1").writelines(csvfile[i:i+100000])
        filename += 1

This is the answer for the question, Thank you all for the help.

Code explore
  • 1
  • 1
  • 5