0

I am trying to split a large csv file into multiple files and I use this code snippet for that. I am using Python 3.7.7 and am on a Windows OS. I tried to add utf8 encoding but still, it doesn't work. Do you know why?

Here is my code:

import os
def split(filehandler, delimiter=',', row_limit=125000, output_name_template='jokes_%s.csv', output_path='.', keep_headers=True):
    """
    Splits a CSV file into multiple pieces.

    A quick bastardization of the Python CSV library.
    Arguments:
        `row_limit`: The number of rows you want in each output file. 10,000 by default.
        `output_name_template`: A %s-style template for the numbered output files.
        `output_path`: Where to stick the output files.
        `keep_headers`: Whether or not to print the headers in each output file.
    Example usage:

        >> from toolbox import csv_splitter;
        >> csv_splitter.split(open('/home/ben/input.csv', 'r'));

    """
    import csv
    reader = csv.reader(filehandler,  delimiter=delimiter)
    current_piece = 1
    current_out_path = os.path.join(
         output_path,
         output_name_template  % current_piece
    )
    print(current_out_path)
    current_out_writer = csv.writer(open(current_out_path, 'w', encoding='utf8', newline=''), delimiter=delimiter)
    current_limit = row_limit
    if keep_headers:
        headers = next(reader)
        current_out_writer.writerow(headers)
    for i, row in enumerate(reader):
        if i + 1 > current_limit:
            current_piece += 1
            current_limit = row_limit * current_piece
            current_out_path = os.path.join(
               output_path,
               output_name_template  % current_piece
            )
            print(current_out_path)
            current_out_writer = csv.writer(open(current_out_path, 'w', encoding='utf8', newline=''), delimiter=delimiter)
            if keep_headers:
                current_out_writer.writerow(headers)
        current_out_writer.writerow(row)

split(open('jokes.csv', 'r'))

And this is the error message:

  File "csv_cutter.py", line 47, in <module>
    split(open('jokes.csv', 'r'))
  File "csv_cutter.py", line 33, in split
    for i, row in enumerate(reader):
  File "C:\Program Files\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6409: character maps to <undefined>
AndrejCoding
  • 127
  • 9
  • You can try this one [charmap codec can't decode](https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character) – ksohan May 14 '20 at 18:22
  • 1
    The traceback shows that the error is on this line `split(open('jokes.csv', 'r'))` - you need to set the encoding here as well. – snakecharmerb May 14 '20 at 18:24

1 Answers1

3

You can change split(open('jokes.csv', 'r')) to split(open('jokes.csv', 'r', encoding="utf8")) and give a try.

ksohan
  • 1,165
  • 2
  • 9
  • 23