1

I have 3000 Excel files. I want to get headers of each file and store it as a csv. However, I am running into a parsing error:

 'utf-8' codec can't decode byte 0xfa in position 1: invalid start byte

I have already seen this post. It does not solve the problem: UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

import glob
import pandas as pd

all_files = glob.glob("Converted Excels/*.xlsx")
file = all_files[0]

#Try 1
columns = []
with open(file, "r") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    for row in csv_reader:
        columns.append([row])
        break

#Try 2
df = pd.read_csv(file, header=0, nrows=1)
df

Here is an example file. https://docs.google.com/spreadsheets/d/194QD14g_L0NQK6j3yO2Et2ZzycfQDzJXu7vdlr20owA/edit?usp=sharing

I converted this to Excel from a PDF. But during conversion, I had specified encoding="utf8".

How can I get the header from this file?

Thanks a lot for your help.

Student
  • 13
  • 4
  • `df = pd.read_csv('2011 abril 1.csv - 2011 abril 1.csv')` just worked for me.. – Erfan Apr 04 '19 at 21:16
  • `b'\x9d'.decode('cp437')` gives `¥` - so maybe your file is in `CP437`, not in `UTF8` and you have to use correct `endcoding` when you read it. – furas Apr 04 '19 at 21:25
  • @furas, sorry. I updated the proper error code. – Student Apr 04 '19 at 21:40
  • @Erfan, can you please explain that notation? – Student Apr 04 '19 at 21:40
  • I also download file from link and `read_csv('2011 abril 1.csv - 2011 abril 1.csv')` works for me too. Linux Mint, Python 3.7, pandas 0.24.1 – furas Apr 04 '19 at 21:42
  • `0xFA` is `ú` in encoding `Windows-1252` - so maybe you should use `encoding='windows1252'` when you load files. See: https://www.i18nqa.com/debug/utf8-debug.html – furas Apr 04 '19 at 21:45
  • BTW: `.xlsx` is not csv file. You can't use `read_csv()` or module `csv` to read `.xlsx`. Use `read_excel()`. As I know `.xlsx` is zip file with file xml inside. – furas Apr 04 '19 at 21:50
  • @furas Oh! "df = pd.read_excel(file)" solved the problem. Can you please post that as an answer? – Student Apr 04 '19 at 21:57

1 Answers1

0

.xlsx is not CSV file. You can't use pandas.read_csv() or module csv to read .xlsx.

Use pandas.read_excel() or modules for excel files. See: www.python-excel.org

As I know .xlsx is ZIP file with XML file inside - so you can also try to unzip it and read xml.

furas
  • 134,197
  • 12
  • 106
  • 148