0

I am reading in a web exported file that has some BOM  characters at the start. I was able to handle them with Pandas 0.25.0 using:

df = pd.read_csv("filepath", encoding='windows-1252', usecols=["user_id", "date", "page"])

however upgrading to pandas 1.1.0 then causes the same code to error that expected column "user_id" not found as in the newer pandas version it sees the column as "user_id".

Any ideas on workarounds as would like to use newer version of pandas?

It is different to the suggested questions so far as it is not just a case of using the utf-8-sig encoding as that throws errors and it works fine with Pandas 0.25 so something has changed in the way Pandas works in version 1.0 onwards. I just wanted to be able to not read in the BOM characters as part of the header. My answer below was on how to get around it as no one seems to know how to handle it in newer versions of Pandas. Not sure why it was downvoted when it provides the answer to my question and would help others with the same problem.

AJR
  • 177
  • 2
  • 7
  • I was thinking `BOM` is used in `UTF-8`, `UTF-16` and similar but not in `windows-1252` - Wikipedia: [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) – furas Jan 07 '21 at 14:39
  • `windows-1252` is not UTF8. `windows-1252` is the Latin1 codepage people typically (but mistakenly) call `ASCII`. A BOM is used only in Unicode files – Panagiotis Kanavos Jan 07 '21 at 15:08
  • 1
    Does this answer your question? [Pandas df.to\_csv("file.csv" encode="utf-8") still gives trash characters for minus sign](https://stackoverflow.com/questions/25788037/pandas-df-to-csvfile-csv-encode-utf-8-still-gives-trash-characters-for-min) – Panagiotis Kanavos Jan 07 '21 at 15:12
  • It is different to that link problem, I tried both UTF encoding options but they won't read in the file, both give me an error that the codec can't decode byte: invalid start byte which is why I had to use windows-1252, that and latin1 read in the file but then include the BOM characters as part of the first header – AJR Jan 07 '21 at 16:25

1 Answers1

0

I managed to find a work around, but not the most elegant, so happy to hear if anyone else knows how to fix the issue while loading the file

df = pd.read_csv("filepath", encoding='windows-1252')
df = df.rename(columns=dict((col, col.replace('"user_id"', "user_id")) for col in df.columns))
df = df[["user_id", "date", "page"]]
AJR
  • 177
  • 2
  • 7
  • 1
    Why don't you specify the correct encoding? `utf-8` or `utf-16`? Although, if `read_csv` wokred in the past, you have a UTF8 file. Try using `utf-8-sig` to take care of the BOM. – Panagiotis Kanavos Jan 07 '21 at 15:09
  • 1
    @PanagiotisKanavos I tried both those encoding options but they both give me an error that the codec can't decode byte: invalid start byte which is why I had to use windows-1252 – AJR Jan 07 '21 at 16:21
  • im trying the same because a can read al csv files (300) setting up encoding='ISO-8859-1', if I use utf-8 or utf-16 it breaks witha error like "UnicodeError: UTF-16 stream does not start with BOM" or "can't decode byte" – qleoz12 Nov 09 '22 at 21:11