parse .xls file from dropbox link

Question

I am trying to parse a table from a dropbox link (https://www.dropbox.com/s/i77mern7joxc9ur/TestResultCodelistVoC.xlsx). It is a .xlsx table and I have tried two methods so far

METHOD 1

codeID_url = 'https://www.dropbox.com/s/i77mern7joxc9ur/TestResultCodelistVoC.xlsx'

tables = pd.read_html(codeID_url)
df_codeID = tables[0]

gives

ValueError: No tables found

Which makes sense, as, in the end, I am not parsing a table from a html page. The commands above work perfectly fine for the tables in this page (https://www.ecdc.europa.eu/en/covid-19/variants-concern)

METHOD 2

codeID_url = 'https://www.dropbox.com/s/i77mern7joxc9ur/TestResultCodelistVoC.xlsx'
data = pd.read_excel(codeID_url,'TestResultCodelistVoC')

gives:

XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<!DOCTYP'

I did find a topic on this same error here, though all the answers are dealing with a local .xls file, and in my case I am trying to parse a webpage/ link, which is in the end a .xls file.

I also came across one solution using a dropbox token, though I first would like to try to download the aforementioned table without a dropbox account, if possible.

score 0 · Answer 1 · 2021-06-30T14:10:24.317

Add ?dl=1 to the end of your URL.

>>> import pandas as pd
>>>
>>> url = 'https://www.dropbox.com/s/i77mern7joxc9ur/TestResultCodelistVoC.xlsx?dl=1'
>>> df = pd.read_excel(url)
>>> print(df)
             Codelistname  Codesystem name  ...                                     Short label DE 1st Release
0   TestResultCodelistVoC              NaN  ...                                  Confirmed 501Y.V1         NaN
1   TestResultCodelistVoC              NaN  ...                                  Confirmed 501Y.V2         NaN
2   TestResultCodelistVoC              NaN  ...                                  Confirmed 501Y.V3         NaN
3   TestResultCodelistVoC              NaN  ...                               Confirmed 501Y.V3.P1         NaN
4   TestResultCodelistVoC              NaN  ...                               Confirmed 501Y.V3.P2         NaN
5   TestResultCodelistVoC              NaN  ...                Confirmed not one of the listed VOC         NaN
6   TestResultCodelistVoC              NaN  ...                            Compatible with 501Y.V1         NaN
7   TestResultCodelistVoC              NaN  ...                            Compatible with 501Y.V2         NaN
8   TestResultCodelistVoC              NaN  ...                            Compatible with 501Y.V3         NaN
9   TestResultCodelistVoC              NaN  ...                         Compatible with 501Y.V3.P1         NaN
10  TestResultCodelistVoC              NaN  ...                         Compatible with 501Y.V3.P2         NaN
11  TestResultCodelistVoC              NaN  ...                          Compatible with 501Y.V2-3         NaN
12  TestResultCodelistVoC              NaN  ...                              Compatible with a VOC         NaN
13  TestResultCodelistVoC              NaN  ...                             Confirmed MinkCluster5         NaN
14  TestResultCodelistVoC              NaN  ...                       Compatible with MinkCluster5         NaN
15  TestResultCodelistVoC              NaN  ...                        Not compatible with 501Y.V1         NaN
16  TestResultCodelistVoC              NaN  ...                      Not compatible with 501Y.V2-3         NaN
17  TestResultCodelistVoC              NaN  ...  No compatibility with VOC detected (VOC not fu...         NaN
18  TestResultCodelistVoC              NaN  ...                           Other variant of concern         NaN

[19 rows x 12 columns]
>>>

Could you parse the url as a df with this very command? I am getting `XLRDError: Excel xlsx file; not supported`. My version of xlrd is `2.0.1` — BCArg, Jun 30 '21 at 13:45
You'll need to install `openpyxl`. As the error says, `xlrd` does not support XLSX files. — , Jun 30 '21 at 14:08
It even says so at the very top of the [documentation for xlrd](https://pypi.org/project/xlrd/) - "`xlrd` is a library for reading data and formatting information from Excel files in the historical `.xls` format." — , Jun 30 '21 at 14:16
that's odd then, I do have openpyxl installed. `conda list | grep 'openpy` returns `openpyxl 3.0.5 py_0 anaconda`. I am now using the very same command as you still getting `XLRDError: Excel xlsx file; not supported`. Perhaps this is related to the pandas version? I have `0.25.3` — BCArg, Jul 01 '21 at 08:05
What do you get when you try `import openpyxl` in a python script? — , Jul 05 '21 at 15:39
I have `pandas==1.2.2`, `openpyxl==3.0.7`, and `requests==2.25.1`. — , Jul 05 '21 at 16:31
I don't get any errors. My versions are `pandas==0.25.3`, `openpyxl==3.0.5` and same version as you for requests. But anyhow it turns out that now it is no longer necessary to parse that table, it might be related to the package version — BCArg, Jul 06 '21 at 09:41

parse .xls file from dropbox link

1 Answers1