0
import csv

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename)
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)

if __name__ == '__main__':
    raw_load_data = readCSV("Total_load_2020.csv")
    raw_forecast_data = readCSV("Total_load_forecast_2020.csv")

The data follows csv (downloaded online) and looks like follow:

RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
...

But the output contains some weird characters (non-existing in data):

['RowDate', 'RowTime', 'TotalLoad']
['RowDate', 'RowTime', 'TotalLoadForecast']

Of course, I can easily remove it. But why does that happen in the first place?

Xu Siyuan
  • 27
  • 7
  • 2
    It looks like your file has a Unicode byte-order marker at the beginning. I suggest you do `file.seek(3)` immediately after opening it. You could check the first byte to see if it is 0xEF, 0xFE, or 0xFF. – Tim Roberts Nov 04 '21 at 18:50
  • 1
    FWIW, `pandas.read_csv` can download CSV files into a dataframe for you (and correctly parses header rows) – OneCricketeer Nov 04 '21 at 18:53
  • 1
    Same problem as: https://stackoverflow.com/questions/22974765/weird-characters-added-to-first-column-name-after-reading-a-toad-exported-csv-fi – Marc-Alexandru Baetica Nov 04 '21 at 18:53
  • @OneCricketeer Yes, I gave it a try in the past before. But can it do something post-processing. It's returning a dictionary. What if I need to process specific line of rows – Xu Siyuan Nov 04 '21 at 19:17
  • Pandas returns a dataframe, not a dictionary, but yes, it can do as much processing as you want (such as filtering datetime objects) – OneCricketeer Nov 04 '21 at 19:24
  • That's a BOM, represented in Windows's CP1252 encoding; see a solution below. – Zach Young Nov 05 '21 at 00:30
  • @Marc-AlexandruBaetica, it seems like a different problem, to me. BOM for sure, but different encoding, and different language (Python, not R). – Zach Young Nov 05 '21 at 00:36

2 Answers2

1

Yes, that's a BOM, U+FEFF BYTE ORDER MARK. OP's file is probably encoded UTF-8, but OP appears to be decoding it as CP-1252.

I say that because the three-byte sequence for a UTF-8-encoded BOM is \xEF\xBB\xBF and appears as  when (wrongly?) decoded as CP-1252^1:

Encoding Representation (hexadecimal) Representation (decimal) Bytes as CP1252 characters
UTF-8 EF BB BF 239 187 191 

Here's how to mock up OP's data with a leading BOM, from a BSD shell:

% echo -e '\xEF\xBB\xBFRowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45' > sample.csv

and confirm it's there with less sample.csv:

<U+FEFF>RowDate,RowTime,TotalLoadForecast
01/01/2020,00:00,8600.52
01/01/2020,00:15,8502.06
01/01/2020,00:30,8396.45
sample.csv (END)

Less is correctly interpreting the three UTF-8 bytes as the Unicode code-point U+FEFF.

If OP still needs to read this file as CP-1252, they can try with the following... but I think they'll get errors because it doesn't actually seem like it is CP-1252:

import csv

with open('sample.csv', 'r', newline='', encoding='cp1252') as f:
    # Read the first 3 bytes
    leading_bytes = f.read(3)

    if (leading_bytes != ''):
        f.seek(0)  #  Not a BOM, reset stream to beginning of file
    else:
        pass       # skip BOM

    reader = csv.reader(f)
    for row in reader:
        print(row)

But, I really think this file should be decoded as UTF-8:

with open('sample.csv', 'r', newline='') as f:  # utf-8 is the default encoding
    # Read the first (decoded) Unicode code point
    first_unicode_char = f.read(1)

    if (first_unicode_char != '\ufeff'):
        f.seek(0) #  Not a BOM, reset stream to beginning of file

or, let Python handle the guesswork and eliminate a BOM if it exists, with the utf_8_sig decoder:

with open('sample.csv', 'r', newline='', encoding='utf_8_sig') as f:
Zach Young
  • 10,137
  • 4
  • 32
  • 53
1

Just update line "file = open(filename)" to "file = open(filename, encoding='utf_8_sig')"

def readCSV(filename, begin_date="01/07/2020", end_date="30/09/2020"):
    file = open(filename, encoding='utf_8_sig')
    csvreader = csv.reader(file)
    header = []
    header = next(csvreader)
DM Equinox
  • 31
  • 7