unrecognized character in header of csv

Question

import csv

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename)
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

if __name__ == '__main__':
    raw_load_data = readCSV("Total_load_2020.csv")
    raw_forecast_data = readCSV("Total_load_forecast_2020.csv")

The data follows csv (downloaded online) and looks like follow:

RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...

But the output contains some weird characters (non-existing in data):

['ï»¿RowDate', 'RowTime', 'TotalLoad']
['ï»¿RowDate', 'RowTime', 'TotalLoadForecast']

Of course, I can easily remove it. But why does that happen in the first place?

It looks like your file has a Unicode byte-order marker at the beginning. I suggest you do `file.seek(3)` immediately after opening it. You could check the first byte to see if it is 0xEF, 0xFE, or 0xFF. — Tim Roberts, Nov 04 '21 at 18:50
FWIW, `pandas.read_csv` can download CSV files into a dataframe for you (and correctly parses header rows) — OneCricketeer, Nov 04 '21 at 18:53
Same problem as: https://stackoverflow.com/questions/22974765/weird-characters-added-to-first-column-name-after-reading-a-toad-exported-csv-fi — Marc-Alexandru Baetica, Nov 04 '21 at 18:53
@OneCricketeer Yes, I gave it a try in the past before. But can it do something post-processing. It's returning a dictionary. What if I need to process specific line of rows — Xu Siyuan, Nov 04 '21 at 19:17
Pandas returns a dataframe, not a dictionary, but yes, it can do as much processing as you want (such as filtering datetime objects) — OneCricketeer, Nov 04 '21 at 19:24
That's a BOM, represented in Windows's CP1252 encoding; see a solution below. — Zach Young, Nov 05 '21 at 00:30
@Marc-AlexandruBaetica, it seems like a different problem, to me. BOM for sure, but different encoding, and different language (Python, not R). — Zach Young, Nov 05 '21 at 00:36

Zach Young · Accepted Answer · 2021-11-08T08:29:14.757

Yes, that's a BOM, U+FEFF BYTE ORDER MARK. OP's file is probably encoded UTF-8, but OP appears to be decoding it as CP-1252.

I say that because the three-byte sequence for a UTF-8-encoded BOM is \xEF\xBB\xBF and appears as ï»¿ when (wrongly?) decoded as CP-1252^1:

Encoding	Representation (hexadecimal)	Representation (decimal)	Bytes as CP1252 characters
UTF-8	`EF BB BF`	`239 187 191`	`ï»¿`

Here's how to mock up OP's data with a leading BOM, from a BSD shell:

% echo -e '\xEF\xBB\xBFRowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45' > sample.csv

and confirm it's there with less sample.csv:

<U+FEFF>RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
sample.csv (END)

Less is correctly interpreting the three UTF-8 bytes as the Unicode code-point U+FEFF.

If OP still needs to read this file as CP-1252, they can try with the following... but I think they'll get errors because it doesn't actually seem like it is CP-1252:

import csv

with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
    # Read the first 3 bytes
    leading_bytes = f.read(3)

    if (leading_bytes != 'ï»¿'):
        f.seek(0)  #  Not a BOM, reset stream to beginning of file
    else:
        pass       # skip BOM

    reader = csv.reader(f)
    for row in reader:
        print(row)

But, I really think this file should be decoded as UTF-8:

with open('sample.csv', 'r', newline='') as f:  # utf-8 is the default encoding
    # Read the first (decoded) Unicode code point
    first_unicode_char = f.read(1)

    if (first_unicode_char != '\ufeff'):
        f.seek(0) #  Not a BOM, reset stream to beginning of file

or, let Python handle the guesswork and eliminate a BOM if it exists, with the utf_8_sig decoder:

with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f:

score 1 · Answer 2 · answered Apr 25 '22 at 08:39

Just update line "file = open(filename)" to "file = open(filename, encoding='utf_8_sig')"

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename, encoding='utf_8_sig')
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

unrecognized character in header of csv

2 Answers2

Linked