0

I've attempted to use urllib, requests, and wget. All three don't work.

I'm trying to download a 300KB .npz file from a URL. When I download the file with wget.download(), urllib.request.urlretrieve(), or with requests, an error is not thrown. The .npz file downloads. However, this .npz file is not 300KB. The file size is only 1 KB. Also, the file is unreadable - when I use np.load(), the error OSError: Failed to interpret file 'x.npz' as a pickle shows up.

I am also certain that the URL is valid. When I download the file with a browser, it is correctly read by np.load() and has the right file size.

Thank you very much for the help.


Edit 1:

The full code was requested. This was the code:

loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
wget.download(loadfrom, savedir)
data = np.load(savedir)

I've also used variants with urllib:

loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
urllib.request.urlretrieve(loadfrom, savedir)
data = np.load(savedir)

and requests:

loadfrom = "http://example.com/dist/x.npz"
savedir = "x.npz"
r = requests.get(loadfrom).content
with open("x.npz",'wb') as f:
    f.write(r)
data = np.load(savedir)

They all produce the same result, with the aforementioned conditions.

  • A `.npz` file is supposed to a `zip` archive. However `np.load` depends on finding a `ZIP_PREFIX` string at the start. Failing that it looks for a `.npy` prefix, or a `pickle` prefix. It all those fail, then the file is corrupted in some way, and `np.load` cannot read it. – hpaulj Feb 17 '19 at 06:24
  • what's the content of the file you download? can you share the code? – abolotnov Feb 17 '19 at 06:24
  • @hpaulj The file is surely not corrupted. I am able to download the file with my browser, and when I do so, numpy can perfectly read the file. It seems that the problem is caused by python downloading the file. –  Feb 17 '19 at 09:13
  • @abolotnov See update. –  Feb 17 '19 at 09:14
  • 1
    What’s the content of the file that downloads? – abolotnov Feb 17 '19 at 15:56
  • @abolotnov It's a .npz file that contains a bunch of numpy arrays. See edit for the URL. –  Feb 18 '19 at 01:22
  • 1
    No, what’s inside the 1kb file? I think you are not encoding the url properly and getting 404 page in that download or something – abolotnov Feb 18 '19 at 01:29
  • @abolotnov Oh, I see what you mean. My bad. I just opened up the file and I'm seeing a lot of markup. One of the relevant bits is: `` –  Feb 18 '19 at 01:34
  • @abolotnov Thank you so much for the help! Checking the contents of the file should've been the first thing I did. I'm attempting to fix the problem, and I think a similar question has already been asked: https://stackoverflow.com/questions/34417412/python-get-url-contents-when-page-requires-javascript-enabled However, it would be great if my task of simply downloading a file didn't require something like selenium. –  Feb 18 '19 at 01:37

2 Answers2

0

Kindly show the full code and the exact lines you use to download the file. Remember you need to use

r=requests.get("direct_URL_of_your_file.npz").content
        with open("local_file.npz",'wb') as f:
            f.write(r)

Also make sure the URL is a direct download link.

Abhinay Pandey
  • 46
  • 3
  • 15
  • 1
    With requests, I used the exact same code that you've written here. The URL is something like this: `http://example.com/dist/x.npz`. –  Feb 17 '19 at 07:14
  • Try using the terminal and there instead of writing to a file, try printing it so as to see what it returns. Sometimes due to wrong URLS "Error:404" might be returned. – Abhinay Pandey Feb 17 '19 at 07:17
  • 1
    I used both the terminal and ran a `.py` file with the `python` command, still didn't work. –  Feb 17 '19 at 07:25
  • try adding the npz to GitHub and use the raw GitHub link and share the result. This would bifurcate the discussion into discussing if the problem is with the url or the npz file. – Abhinay Pandey Feb 17 '19 at 07:34
  • In the question, I've mentioned that when I download the file with a browser, I don't get any kind of error. I have no problem reading the browser-downloaded file. The issue is likely with python - python won't download the file properly. –  Feb 17 '19 at 07:38
  • Browsers support redirects, while a python module don't. moreover try printing the content on a terminal, that would show what it is downloading. in terminal do `request.get("url").content` and show what comes out. Or maybe sharing the URL will help us to investigate ourselves (If its not confidential) – Abhinay Pandey Feb 17 '19 at 07:56
  • The link requires JavaScript to work. Here is the response being sent by the website link : https://pastebin.com/qttBud8y – Abhinay Pandey Feb 18 '19 at 10:10
0

The issue was that the server needed javascript to run as a security precaution. So, when I send the request, all I got was html with "This Site Requres Javascript to Work". I found out that there was a __test cookie that needed to be passed during the request.

This answer explains it fully. This video may also be helpful.