0

Background

OS: Red Hat linux, Python Version: 3.6

I have several thousand files I need to load into HDFS on a weekly basis to feed some Impala tables. Some of the fields have interstitial commas, like so:

"ingredient","amount"
"CORN, GROUND",1

The fields are quoted, but Impala ignores that, and reads those interstitial commas as field separators. I'm trying to use Python to remove them. When exported from the source system (possibly SQL Server, but I don't know for sure), NULLs in fields are exported as NUL bytes like so:

NUL bytes just chilling circa 2021, colorized

Problem

When the code runs into a file with NUL bytes, it throws an error. I think the errors='ignore' is supposed to prevent the reader from choking on the null bytes, and I have also tried using Python's chardet built-in to detect the encoding ahead of time and open the file properly. No dice-- although it turns out that some of the files DO have different encoding, trying to tailor the open() doesn't seem to work.

Code

## This is in a separate utility module from the script below  
  def hasInFieldDelimiters(self,file_obj_to_parse,quote_char='"',field_delimiter=',',delimiter_to_find=','):
    print("Finding commas in file {}".format(file_obj_to_parse))
    row_reader = csv.reader(file_obj_to_parse, delimiter=field_delimiter, quotechar=quote_char)
    for row in row_reader:
      for field in row:
        if delimiter_to_find in field:
          print("In-field comma found")
          return True
    print("No commas found")
    return False
## This is the script I run.  The method above is called three lines below as "util.hasInFieldDelimiters()"
for file in os.listdir(file_path):
  encoding = util.getEncoding(file_path,file)
  with open(file_path+file,'r+',newline='',encoding=encoding,errors='ignore') as file_obj_to_parse:
    has_in_field_delimiters = util.hasInFieldDelimiters(file_obj_to_parse)
  if has_in_field_delimiters==True:
    to_write = []
    with open(file_path+file,'r+',newline='',encoding=encoding,errors='ignore') as file_obj_to_parse:
      reader = csv.reader(file_obj_to_parse,delimiter=',',quotechar='"')
      for row in reader:
        for field in row:
          field = field.replace(',','')
        to_write += row
    with open(file_path+file,'w',encoding='latin-1') as replacement:
      writer = csv.writer(replacement)
      writer.writerows(to_write)

What Else Have I Tried?

I have tried several different ways of calling both open() and csv.reader(). I cannot install new modules like "codecs" or "xlrd" (shared remote system) and have to make do with the CSV builtin.

Python CSV error: line contains NULL byte

python Dictread of CSV file with NUL bytes in data

File not opening past NUL byte

Python - Finding unicode/ascii problems

Drew R
  • 133
  • 7
  • Q: Any chance you can determine exactly who is writing the null characters (and possibly changing it's "write" method)? Q: Alternatively, any chance you can pre-process all files with a simple script (e.g. a 3-line C program) that reads each character, and only writes out the non-"null" characters? – paulsm4 Nov 11 '21 at 23:53
  • Unfortunately I have no control over what comes to me and don't know C. What I DID end up doing was just end up using Pandas, as it handles encoding and CSVs in general far more elegantly. – Drew R Nov 26 '21 at 19:05

0 Answers0