0

I am trying to read an xlsx file into a pandas df without downloading it but I am having issues.

I have tried read_excel while parsing the URL with sheet_id but encountered the following error that blocks me.

url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/export"
file_df = pd.read_excel(url, engine='openpyxl')

error:

zipfile.BadZipFile: File is not a zip file

I have a service account as authentication(should be enough) and can not wish to use a bearer token as a solution like this suggests.

Any help would be much appreciated.

Solrac
  • 37
  • 1
  • 6
  • What you're trying to load is a google spreadsheet, not an xlsx (Excel) file. They can be read as csv. see https://stackoverflow.com/questions/19611729/getting-google-spreadsheet-csv-into-a-pandas-dataframe – Tranbi Jun 15 '23 at 10:55
  • No I'm trying to load an excel not a gsheet. I know my above approach is incorrect – Solrac Jun 15 '23 at 12:04
  • OK. It seems you have to export it anyway. Try the code in my posted answer. – Tranbi Jun 15 '23 at 12:27

1 Answers1

1

Since your spreadsheet is hosted on google, you should export it to the desired format. Either csv:

df = pd.read_csv(f"https://docs.google.com/spreadsheets/export?id={sheet_id}&format=csv")

Or xlsx:

df = pd.read_excel(f"https://docs.google.com/spreadsheets/export?id={sheet_id}&format=xlsx")
Tranbi
  • 11,407
  • 6
  • 16
  • 33
  • The second options results in the same error as the one I mentioned above. The first option returns a df object but it seems to be all messed up. The headers and records are what appears to be HTML. – Solrac Jun 15 '23 at 13:05
  • To add to this I am sharing the file with the service account user, so auth shouldn't be the issue – Solrac Jun 15 '23 at 13:11
  • 1
    did you specify `format=xlsx` at the end of the url for the second option? Regarding the messed up headers, I can't reproduce it. Maybe you can share a sample? – Tranbi Jun 15 '23 at 13:14