0

edit: this questions Convert UTF-8 with BOM to UTF-8 with no BOM in Python which only works on txt files, does not solve my issue with csv files

I have two csv files

rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"

here is the ads-test.csv file (which is causing issues) https://easyupload.io/bk1krp the file is UTF-8 with BOM is what vscode bottom right corner says when i open the csv.

and I am trying to write a python function to read in every row, and convert it to a dict object.

my function works for the first file rtc-test.csv just fine, but for the second file ads-test.csv I get an error UTF-16 stream does not start with BOM when i use utf-16. so ive tried to use utf-8 and utf-8-sig but it only reads in each line as a string with commas separating values. I cant split by comma because I will have column values which include commas.

my python code correctly reads in rtc-test.csv as a list of values. How can I read in ads-test.csv as a list of values when the csv is encoded using utf-8 with bom?

code:

rtc_csv_file="csv_migration\\rtc-test.csv"
ads_csv_file="csv_migration\\ads-test.csv"

from csv import reader
import csv

# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
    print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
    dict={}
    dict['rows']={}
    try:
        with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
            csv_reader = reader(read_obj, delimiter='\t')
            csv_cols = None
            for row in csv_reader:
                if csv_cols is None:
                    csv_cols = row 
                    dict['csv_cols']=csv_cols
                    print('csv_cols=',csv_cols)
                else:
                    row_id_val = row[csv_cols.index(str(id_format))]
                    print('row_id_val=',row_id_val)
                    dict['rows'][row_id_val] = row
        print('done')
        return dict
    except Exception as e:
        print('err=',e)
        return {}

rtc_dict = read_csv_as_map(rtc_csv_file, 'Id', 'utf-16')
ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-16')

console output:

filename: csv_migration\rtc-test.csv, id_format: Id, encoding: utf-16
csv_cols= ['Summary', 'Status', 'Type', 'Id', '12NC']
row_id_val= 262998
done
filename: csv_migration\ads-test.csv, id_format: ID, encoding: utf-16
err= UTF-16 stream does not start with BOM

if i try to use utf-16-le instead, i get a different error 'utf-16-le' codec can't decode byte 0x22 in position 0: truncated data

if i try to use utf-16-be, i get this error: 'utf-16-be' codec can't decode byte 0x22 in position 0: truncated data

why cant my python code read this csv file?

POVR2
  • 87
  • 10
  • 2
    The file is not encoded as UTF-16, or is not encoded correctly somehow. Can you share the first few bytes (say 10 or 12) of the file? – snakecharmerb Aug 10 '22 at 14:51
  • 2
    We can’t tell you the correct encoding without seeing (a representative, ideally small sample of) the actual contents of the data in an unambiguous representation; a hex dump of the problematic byte(s) with a few bytes of context on each side is often enough, especially if you can tell us what you think those bytes are supposed to represent. See also https://meta.stackoverflow.com/questions/379403/problematic-questions-about-decoding-errors – tripleee Aug 10 '22 at 14:54
  • @snakecharmerb im trying to view the file with a hex editor to get you those bytes, but in the mean time here is the problematic file: https://easyupload.io/bk1krp – POVR2 Aug 10 '22 at 15:02
  • My editor identifies that file as utf-8 with BOM. – Steven Rumbalski Aug 10 '22 at 15:11
  • What text editor do you use? VS Code displays the file encoding on the bottom right. I'm sure other good editors do something similar. So open your file in a good text editor to make sure you have the right encoding. – Steven Rumbalski Aug 10 '22 at 15:16
  • 1
    if I parse the file using my function above, with `encoding=utf-8` and `delimiter='\t'` , when i get to `for row in csv_reader:` the first result is this single string `\ufeffTitle,State,Work Item Type,ID,12NC`. where as calling the function for rtc_csv_file gives a list of each row. Im hoping to parse the problematic ads file in a similar way, where it gets a list of each column name without me having to parse/split by a char such as comma – POVR2 Aug 10 '22 at 15:18
  • If your input file mixes comma-delimited rows and tab-delimited rows, it is not a well-formed CSV file. You can work around it by splitting again on comma on the rows which end up containing a single value. But this is really too vague still without a representative sample of the actual data. – tripleee Aug 14 '22 at 10:26

1 Answers1

0

Your CSV is encoded with UTF-8 (the default) instead of UTF-16, so pass that as the encoding:

ads_csv_file="ads-test.csv"

from csv import reader

# read in csv, convert to map organized by 'id' as index root parent value
def read_csv_as_map(csv_filename, id_format, encodingVar):
    print('filename: '+csv_filename+', id_format: '+id_format+', encoding: '+encodingVar)
    dict={}
    dict['rows']={}
    try:
        with open(csv_filename, 'r', encoding=encodingVar) as read_obj:
            csv_reader = reader(read_obj, delimiter='\t')
            csv_cols = None
            for row in csv_reader:
                if csv_cols is None:
                    csv_cols = row
                    dict['csv_cols']=csv_cols
                    print('csv_cols=',csv_cols)
                else:
                    row_id_val = row[csv_cols.index(str(id_format))]
                    print('row_id_val=',row_id_val)
                    dict['rows'][row_id_val] = row
        print('done')
        return dict
    except Exception as e:
        print('err=',e)
        return {}

ads_dict = read_csv_as_map(ads_csv_file, 'ID', 'utf-8')  # <- updated here

Here's the CSV for reference:

Title,State,Work Item Type,ID,12NC
"453560751251 TOOL, SQ-59 CORNER CLAMP","To Do","FRUPS","6034","453560751251"
vahlala
  • 355
  • 3
  • 14
  • if I parse the file using my function above, with encoding=utf-8 and delimiter='\t' , when i get to for row in csv_reader: the first result is this single string `\ufeffTitle,State,Work Item Type,ID,12NC.` . if i call the function for rtc_csv_file gives a list of each row. Im hoping to parse the problematic ads file in a similar way, where it gets a list of each column name without me having to parse/split by a char such as comma – POVR2 Aug 10 '22 at 15:18
  • @POVR2 Your file is either malformed or in an encoding that's not UTF8 or UTF16. With each of these encodings I got the same result: `utf-8`, `utf-16-le`, `utf-16-be`. If you know that your file will only have commas a special character, you can just read the file as a string and split on `\n` and `,` – vahlala Aug 10 '22 at 15:26
  • split on `,` comma characters might work for the col headers, but i will eventually have col values which include comma characters in description/comment fields. so I cant reliably split strings with characters. Ive tried utf-8, and utf-8-sig, but they still just return a single string for each row, instead of separating into a list – POVR2 Aug 10 '22 at 15:29
  • You can use `encoding='utf-8-sig'` to discard the BOM. – tripleee Aug 14 '22 at 09:48