-1

My actual question was this

Read exel file from url in python 3.6

but having tried the instructions in comments I got the error
io.UnsupportedOperation: seek

Here's the code:

import pandas as pd
from urllib.request import Request, urlopen
url = "https://<myOrg>.sharepoint.com/:x:/s/x-taulukot/Ec0R1y3l7sdGsP92csSO-mgBI8WCN153LfEMvzKMSg1Zzg?e=6NS5Qh"
req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')

content = urlopen(req)
pd.read_excel(content)
print(df)

and the result:

(venv) miettinj@ramen:~/beta/python> python test.py
Traceback (most recent call last):
  File "test.py", line 9, in <module>
    pd.read_excel(content)
  File "/srv/work/miettinj/beta/python/venv/lib/python3.6/site-packages/pandas/util/_decorators.py", line 296, in wrapper
    return func(*args, **kwargs)
  File "/srv/work/miettinj/beta/python/venv/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 304, in read_excel
    io = ExcelFile(io, engine=engine)
  File "/srv/work/miettinj/beta/python/venv/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 851, in __init__
    if _is_ods_stream(path_or_buffer):
  File "/srv/work/miettinj/beta/python/venv/lib/python3.6/site-packages/pandas/io/excel/_base.py", line 800, in _is_ods_stream
    stream.seek(0)
io.UnsupportedOperation: seek
Jaana
  • 266
  • 1
  • 3
  • 14

2 Answers2

-1

It can work by passing a string of the page (excel) content:

import pandas as pd
from urllib.request import Request, urlopen
url = "https://<myOrg>.sharepoint.com/:x:/s/x-taulukot/Ec0R1y3l7sdGsP92csSO-mgBI8WCN153LfEMvzKMSg1Zzg?e=6NS5Qh"
req = Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0')

content = urlopen(req)
pd.read_excel(content.read().decode())
print(df)
Learning is a mess
  • 7,479
  • 7
  • 35
  • 71
  • With this I got the following: File "/srv/work/miettinj/venv-37/lib/python3.7/site-packages/fsspec/core.py", line 315, in _un_chain cls = get_filesystem_class(protocol) File "/srv/work/miettinj/venv-37/lib/python3.7/site-packages/fsspec/registry.py", line 208, in get_filesystem_class raise ValueError("Protocol not known: %s" % protocol) ValueError: Protocol not known: – Jaana Jun 22 '23 at 05:16
  • What about `pd.read_excel(StringIO(content.read().decode()))`, with `StringIO` imported from `io` – Learning is a mess Jun 22 '23 at 08:36
  • with pd.read_excel(StringIO(content.read().decode())) I get this: Traceback (most recent call last): File "stringIO.py", line 10, in pd.read_excel(StringIO(content.read().decode())) File "/srv/work/miettinj/venv-37/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 364, in read_excel io = ExcelFile(io, storage_options=storage_options, engine=engine) File "/srv/work/miettinj/venv-37/lib/python3.7/site-packages/pandas/io/excel/_base.py", line 1079, in inspect_excel_format assert isinstance(buf, bytes) AssertionError – Jaana Jun 22 '23 at 09:00
  • Looks like it wants a byte input: I would try `pd.read_excel(BytesIO(content.read()))` with `BytesIO` from io – Learning is a mess Jun 22 '23 at 09:02
  • with BytesIO I get this: pd.read_excel(BytesIO(content.read().decode())) TypeError: a bytes-like object is required, not 'str' – Jaana Jun 22 '23 at 09:08
  • Note that I removed the `.decode()` for this reason – Learning is a mess Jun 22 '23 at 09:09
  • Thanks, with pd.read_excel(BytesIO(content.read()) I got: "Excel file format cannot be determined, you must specify " ValueError: Excel file format cannot be determined, you must specify an engine manually. With pd.read_excel(BytesIO(content.read())), engine='openpyxl') I got: raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file – Jaana Jun 22 '23 at 09:14
  • Okay, I think I cannot help you further without having access to your remote resource - which I understand you prefer not to provide. In case you are the one who downvoted both candidate answers, I don't find it fair to do so in a setting where we cannot reproduce your issue locally and test our solutions. – Learning is a mess Jun 22 '23 at 09:16
  • May be of interest https://stackoverflow.com/questions/33873423/xlsx-and-xlsm-files-return-badzipfile-file-is-not-a-zip-file – Learning is a mess Jun 22 '23 at 09:17
-1

You have to append &download=1 to your url to download the file and not to open it in the browser. However, you should use requests package to get your file:

import pandas as pd
import requests

url = 'https://<myOrg>.sharepoint.com/:x:/s/x-taulukot/Ec0R1y3l7sdGsP92csSO-mgBI8WCN153LfEMvzKMSg1Zzg?e=6NS5Qh'
url += '&download=1'  # now you have a link to the file and not the editor
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

resp = requests.get(url, headers=headers)
df = pd.read_excel(resp.content, engine='openpyxl')

Update: I tried with Python==3.6.15, Pandas==1.1.15 and openpyxl==3.1.2 and it works perfectly fine.

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • With this df = pd.read_excel(url, storage_options=headers, engine='openpyxl') I got error: zipfile.BadZipFile: File is not a zip file. If I was reading a xls-file present in my server I should just close the related hidden files, but when reading the url I don't know how to close them... I'm refering to https://stackoverflow.com/questions/68478097/pandas-glob-excel-file-format-cannot-be-determined-you-must-specify-an-engi – Jaana Jun 22 '23 at 05:35
  • Pandas uses `urllib.request.Request` and `urlopen` itself that's why you needn't to handle it manually. However, I think your url is not right, I mean it's not the direct link to get your file. – Corralien Jun 22 '23 at 06:16
  • But I get the excel when I paste the same url into browser, thus I think it should be correct – Jaana Jun 22 '23 at 06:42
  • Yes, this code "import requests url="https://.sharepoint.com/:x:/s/x-taulukot/Ec0R1y3l7sdGsP92csSO-mgBI8WCN153LfEMvzKMSg1Zzg?e=6NS5Qh" session_obj = requests.Session() response = session_obj.get(url, headers={"User-Agent": "Mozilla/5.0"}) print(response.status_code)" returns HTTP200. I just don't know how to read the results – Jaana Jun 22 '23 at 07:16
  • `df = pd.read_except(response.content)`. Did you try my solution? (append `&download=1` to your url. – Corralien Jun 22 '23 at 07:21
  • yes, I tried and got this: xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n – Jaana Jun 22 '23 at 07:35
  • You shoud install `openpyxl`: `pip install openpyxl` and use `pd.read_excel(response.content, engine='openpyx'l)`. `xlrd` can't read `xlsX` file. – Corralien Jun 22 '23 at 07:37
  • with engine='openpyxl' I get this: raise BadZipFile("File is not a zip file") zipfile.BadZipFile: File is not a zip file – Jaana Jun 22 '23 at 07:41
  • Did you append `donwload=1` to your URL? – Corralien Jun 22 '23 at 08:23
  • yes. I copied your answer and replaced just in the url – Jaana Jun 22 '23 at 09:02
  • I will try to test with python 3.6. It works for me with python3.9 and pandas2.0 – Corralien Jun 22 '23 at 09:09
  • @Jaana. Did you solve your problem? I tested my code with python 3.6. I had to install openpyxl and force the engine and it works. I updated my answer. Maybe you don't share your file correctly on Office365? – Corralien Jun 28 '23 at 06:40
  • Sorry, I don't understand what you mean by "sharing file correctly on Office365" But I think the problem is that sharepoint requires some user authentication which is not provided by my code samples -- and I don't know how to provide that information. – Jaana Jun 28 '23 at 11:09
  • You can share some resources with *"Anyone with the link"*. Useful to use with Pandas for example: https://support.microsoft.com/en-us/office/share-sharepoint-files-or-folders-1fe37332-0f9a-4719-970e-d2578da4941c. (I will remove my answer and this thread to avoid DV) – Corralien Jun 28 '23 at 11:23