How to get data from xls files using pandas

Question

I am trying to merge some xls files, but when I run the script I don't get the expected result which would be a new file containing all the data from the originals, keeping structure etc. What I get is a lot of coded info and files paths

import pandas as pd
from glob import glob

planilhas_pasta = sorted(glob(r'C:\Users\de0186619\Documents\saldo_usuario_*.xls'))

planilhas_concatenadas = pd.concat((pd.read_csv(file, sep=';', header=3, encoding="ISO-8859-1", on_bad_lines='skip').assign(filename=file)
                                    for file in planilhas_pasta), ignore_index=True)
print(planilhas_concatenadas)

The result I get is:

Lista de Usu&aacute ... filename 0 ... C:\Users\de0186619\Documents\saldo_usuario_092... 3 thead: first-child { ... C:\Users\de0186619\Documents\saldo_usuario_092... 4 counter-reset: page ... C:\Users\de0186619\Documents\saldo_usuario_092...

Can I ask why you’re using `read_csv` to read an Excel file? — S3DEV, Jul 15 '22 at 13:32
@S3DEV when I use read_excel I get the error: 'zipfile.BadZipFile: File is not a zip file' — Vinícius Donato, Jul 15 '22 at 13:39
If it is a true Excel file, the content is binary, which needs to be parsed using an xls-type reader. Perhaps try passing the parsing engine, as discussed [here](https://stackoverflow.com/a/50815107/6340496). — S3DEV, Jul 15 '22 at 13:44
@S3DEV yeah, I've read that could be an engine problem, but even when I use engine='openpyxl', the same error shows up — Vinícius Donato, Jul 15 '22 at 14:03
It appears that the readers are expecting zipped data files, which is what `xlsx` files are; zipped. Appears you have an *old*-style Excel file, as suggested by 1) the extension (`xls`) and the fact `read_excel` / `openpyxl` are expecting a zip file. Try reading with [`xlrd`](https://pypi.org/project/xlrd/). If this opens the file successfully, I'll work up an example to parse the file yourself and load it into a DataFrame. — S3DEV, Jul 15 '22 at 14:11
@S3DEV I tried this solution proposed by you, didn't work either. Now I get the error: -- xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<!DOCTYP' -- I'm guessing that the problem is the XLS files that I am trying to read are created by a reporting software and can’t be read either by pandas or xlrd, but they open in Microsoft Excel with no problems — Vinícius Donato, Jul 15 '22 at 16:47
Hmmm. 1) Can they be read as plain text, as with Notepad or the like? 2) Can they be opened in Excel and re-saved as `.xlsx` files? — S3DEV, Jul 15 '22 at 17:57
@S3DEV yeah, but when I open them with notepad it shows the data as if it was an HTML. I tried re-saving the files changing the format to .xlsx and it worked just fine! Now I just have to find a way to make this change to all the files in a folder, because I have like tens of them. Thanks for your help and time!!! — Vinícius Donato, Jul 15 '22 at 18:21
My pleasure, interesting challenge. Hope you find a way that works for what you need. All the best. — S3DEV, Jul 15 '22 at 18:38
Please provide enough code so others can better understand or reproduce the problem. — Community, Jul 22 '22 at 05:56

How to get data from xls files using pandas

0 Answers0