1

I have about 240 CSV files and some are pretty large. They all contain similar-ish information and basically, I'm just trying to merge them all together but I am getting encoding errors/errors reading the file contents.

Here is the error message that I get:

Traceback (most recent call last):
  File "C:\Users\ethan\Desktop\filemerger.py", line 23, in <module>
    for row in reader:
_csv.Error: line contains NUL

Process finished with exit code 1

Here is my code, any help would be much appreciated:

import os
import csv
import chardet

directory_path = r"A:\FilesMerge"

header_dict = {}

data_rows = []

for filename in os.listdir(directory_path):
    if filename.endswith(".csv"):
        file_path = os.path.join(directory_path, filename)
        with open(file_path, 'r', errors="ignore") as csvfile:
            reader = csv.reader(csvfile)
            headers = next(reader)
            for header in headers:
                if header not in header_dict.keys():
                    header_dict[header] = len(header_dict)
                else:
                    header_dict[header] = min(header_dict[header], len(header_dict)-1)
            data_rows.extend([[],[],[],[filename]])
            for row in reader:
                if len(row) == len(headers):
                    data_rows.append(row)
                else:
                    new_row = [''] * len(headers)
                    new_row[:len(row)] = row
                    data_rows.append(new_row)

sorted_headers = sorted(header_dict.keys(), key=lambda x: header_dict[x])

with open(os.path.join(directory_path, "merged_headers_and_data.csv"), 'rb') as f:
    result = chardet.detect(f.read())


with open(os.path.join(directory_path, "merged_headers_and_data.csv"), 'w', newline='', encoding=result['encoding'], errors='ignore') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(sorted_headers)
    writer.writerows(data_rows)
M Z
  • 4,571
  • 2
  • 13
  • 27
Camol1
  • 35
  • 6
  • Does this answer your question? ["Line contains NULL byte" in CSV reader (Python)](https://stackoverflow.com/questions/7894856/line-contains-null-byte-in-csv-reader-python) – pippo1980 May 01 '23 at 17:41

1 Answers1

0

This error was recently removed from the csv module, in 3.11.0+. You can see the long discussion as to why, here, Issue 27580; that specific error message being removed in the patch, here.

If you can, upgrade to Python 3.11.0+ and move on: you can leave the files as they are and let the decoder just do its thang.

If you cannot upgrade your Python, I believe your only path forward will be to clean the extraneous NULL bytes.

Find and clean NULL bytes in your CSVs

If I were in your shoes, I think I'd first try to characterize the nature of the problem: how many files with how many lines each have the NULL byte, and while at it, are they all UTF-8?

I mocked up a few CSVs in the directory CSV_files:

input1.csv
input2.csv
input3.csv
input4.csv

But before I’d do anything, I'd copy the entire directory to a "working" directory, and do everything else in that—leaving the original to recover from any mistakes.

The following script tries to read each CSV file (purely as text) in that directory as UTF-8 and report some NULL-byte stats and any errors. It uses the csv module, but only for writing the report... it doesn't try to decode the CSVs:

import glob

import csv
import sys

NULL_BYTE = "\x00"

writer = csv.writer(sys.stdout)
writer.writerow(["File", "Lines w/NULL", "NULL count", "Error"])

filenames = glob.glob(r"CSV_files_working/*.csv")

for fname in sorted(filenames):
    line_ct = 0
    nulls_ct = 0
    error = ""

    try:
        with open(fname, encoding="utf-8") as f_in:
            for i, line in enumerate(f_in, start=1):
                if NULL_BYTE in line:
                    line_ct += 1
                    nulls_ct += line.count(NULL_BYTE)

    except Exception as e:
        error = str(e)

    writer.writerow([fname, line_ct, nulls_ct, error])

I'd take that output CSV report into a spreadsheet and see what I was dealing with:

File Lines w/NULL NULL count Error
CSV_files_working/input1.csv 0 0
CSV_files_working/input2.csv 3 6
CSV_files_working/input3.csv 3 3
CSV_files_working/input4.csv 0 0 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I'd then open any non UTF-8 files in a text editor or a spreadsheet and save them as UTF-8. When I opened input4.csv in VSCode, it (correctly) guessed the encoding was UTF-16. I could save the file as UTF-8 right there in VSCode. If I had a lot of files and can guess at their current encoding, I would run a script to convert the non UTF-8 files to UTF-8:

enc_filenames = [
    ("utf-16", "CSV_files_working/input4.csv"),
]

for input_enc, fname in enc_filenames:
    with open(fname, encoding=input_enc) as f_in:
        data = f_in.read()

    with open(fname, "w", encoding="utf-8") as f_out:
        f_out.writelines(data)

Run the report again, python3 find-NULs.py > report.csv:

File Lines w/NULL NULL count Error
CSV_files_working/input1.csv 0 0
CSV_files_working/input2.csv 3 6
CSV_files_working/input3.csv 3 3
CSV_files_working/input4.csv 1 1

I'd then probably want to see exactly what the NULLs looked like... should they be converted to some other character, or can they just be removed?

NULL_BYTE = "\x00"

filenames = [
    "CSV_files_working/input2.csv",
    "CSV_files_working/input3.csv",
    "CSV_files_working/input4.csv",
]

for fname in filenames:
    with open(fname, encoding="utf-8") as f_in:
        print(f"{fname}:")
        for i, line in enumerate(f_in, start=1):
            if NULL_BYTE in line:
                print(f"  line {i:>04}: {repr(line)}")
CSV_files_working/input2.csv:
  line 0002: 'f2r1c1\x00,f2r1c2\x00\n'
  line 0003: 'f2r2c1\x00,f2r2c2\x00\n'
  line 0004: 'f2r3c1\x00,f2r3c2\x00\n'
CSV_files_working/input3.csv:
  line 0002: 'f3r1c1,f3r1c2\x00\n'
  line 0003: 'f3r2c1,f3r2c2\x00\n'
  line 0004: 'f3r3c1,f3r3c2\x00\n'
CSV_files_working/input4.csv:
  line 0004: 'f4r3c1,f4r3c2\x00\n'

All the NULLs in my data come at the end of the field (not between data in the field) so I can safely just remove those NULLs:

NULL_BYTE = "\x00"

filenames = [
    "CSV_files_working/input2.csv",
    "CSV_files_working/input3.csv",
    "CSV_files_working/input4.csv",
]

for fname in filenames:
    with open(fname, encoding="utf-8") as f_in:
        data = [line.replace(NULL_BYTE, "") for line in f_in]

    with open(fname, "w", encoding="utf-8") as f_out:
        f_out.writelines(data)

    print(f"cleaned {fname}")
cleaned CSV_files_working/input2.csv
cleaned CSV_files_working/input3.csv
cleaned CSV_files_working/input4.csv

Run the report again, and all the files are clean:

File Lines w/NULL NULL count Error
CSV_files_working/input1.csv 0 0
CSV_files_working/input2.csv 0 0
CSV_files_working/input3.csv 0 0
CSV_files_working/input4.csv 0 0

Your data might not allow just replacing the NULLs. You might need to transform each to a space (line.replace(NULL_BYTE, " ")), then deal with extra spaces once you can decode the CSV file.

Consolidate CSVs

Finally, I'd like to recommend a different approach to the bigger task, which is to consolidate different CSVs w/different headers. I think you can do all of what you want to do by using the features that DictReader and especially DictWriter already have, and not manage the difference in headers yourself.

My mock CSV files have the following headers:

input1.csv
Col__1,Col__2
input2.csv
Col__3,Col__2
input3.csv
Col__4,Col__7
input4.csv
Col__1,Col9

I want to merge those four files and expect the final header to be:

Col9,Col__1,Col__2,Col__3,Col__4,Col__7

I can get this final output by giving DictWriter all the fieldnames and telling it to insert an empty string ("") for every field when the writer encounters a row that doesn't have all the fields, restval=“”:

import csv
import glob

filenames = glob.glob(r"CSV_files_working/*.csv")

fieldnames: set[str] = set()
all_rows: list[dict[str, str]] = []

for fname in sorted(filenames):
    with open(fname, encoding="utf-8", newline="") as f_in:
        reader = csv.DictReader(f_in)

        if reader.fieldnames is None:
            print(f"{fname} doesn't have fieldnames!")
            exit(1)

        fieldnames.update(reader.fieldnames)

        all_rows.extend(reader)

# Sort first by len of the fieldname, then sort lexically
final_fieldnames = sorted(fieldnames, key=lambda x: (len(x), x))

with open("output.csv", "w", encoding="utf-8", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=final_fieldnames, restval="")
    writer.writeheader()
    writer.writerows(all_rows)

and I get:

| Col9   | Col__1 | Col__2 | Col__3 | Col__4 | Col__7 |
|--------|--------|--------|--------|--------|--------|
|        | f1r1c1 | f1r1c2 |        |        |        |
|        | f1r2c1 | f1r2c2 |        |        |        |
|        | f1r3c1 | f1r3c2 |        |        |        |
|        |        | f2r1c2 | f2r1c1 |        |        |
|        |        | f2r2c2 | f2r2c1 |        |        |
|        |        | f2r3c2 | f2r3c1 |        |        |
|        |        |        |        | f3r1c1 | f3r1c2 |
|        |        |        |        | f3r2c1 | f3r2c2 |
|        |        |        |        | f3r3c1 | f3r3c2 |
| f4r1c2 | f4r1c1 |        |        |        |        |
| f4r2c2 | f4r2c1 |        |        |        |        |
| f4r3c2 | f4r3c1 |        |        |        |        |
Zach Young
  • 10,137
  • 4
  • 32
  • 53