-1

The question is about automating the download of data from an authenticated Django website from a Linux core server. Being able to do it with a python script will be great.

Background of the question

The Challenge Data website is a site proposing Data Sciences challenges. This website is written in Django.

The 18-Owkin data challenge data challenge provides data of quite large size (over 10Gb). You need to be authenticated in order to download the data.

I'm able to download the files from my Windows 10 laptop after authentication to the website and clicking to a download link like y_train. Downloading starts automatically in that case.

However, I want to upload the data to a GPU Cloud Linux Core (without GUI) machine. I can do that from my laptop, but it is very slow as my up-going bandwidth is low.

Do you see a way to get the data directly from the Linux core server? That would mean:

  • Authenticate to the (Django) website.
  • Then connect to the "download URL". Will it perform the download?
  • You can use the Requests library to perform the authentication, then, depending on how the authentication is done, just send a request to the website to get the file with the authentication headers, or use the authenticated session to get the data. To do so, I would open my browser's Inspect function and watch the Network section when I attempt the authentication/download. I would then construct a Request to do the authentication for me and download the file(s) to a desired folder on my Linux box. – BoboDarph Jun 21 '19 at 13:16
  • @BoboDarph Thanks for your answer. I was indeed looking at [Login to webpage from script using Requests and Django](https://stackoverflow.com/questions/24562590/login-to-webpage-from-script-using-requests-and-django) post on how to use Requests Libray. I need then to figure out how to get the site response and file the output. – mathcounterexamples.net Jun 21 '19 at 13:25
  • Django doesn't matter here, it's just a normal GET request to https://challengedata.ens.fr/participants/challenges/18/download/y-test with 3 cookies which you need to get from the authentication part: the csfrtoken and BSI-CS, and the sessionid. PS: Remember to logout before you end your script, otherwise your next login might fail. – BoboDarph Jun 21 '19 at 13:27
  • To get the large file, all you need to do is chunk it down and write it as it comes along: https://stackoverflow.com/questions/16694907/download-large-file-in-python-with-requests – BoboDarph Jun 21 '19 at 13:33
  • @BoboDarph Thanks! With all your advices, I'm almost there. – mathcounterexamples.net Jun 21 '19 at 14:33

1 Answers1

1

With the great help of BoboDarph, I came to following Python script that is working:

# python script:
import requests
from getpass import getpass
from os import stat

# constants
url_login = 'https://challengedata.ens.fr/login/'
url_logout = 'https://challengedata.ens.fr/userlogout'

username = input("Challenge data Username: ")
password = getpass("Challenge data Password: ")

client = requests.session()
login_text = client.get(url_login)

csrftoken = client.cookies['csrftoken']

login_data = ({'username': username, 'password': password, 'csrfmiddlewaretoken': csrftoken,
               'next': 'https://challengedata.ens.fr/participants/challenges/18/download/y-train'})
r = client.post(url_login, data=login_data)

csrftoken = client.cookies['csrftoken']

file_save = 'training_output.csv'

with open(file_save, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=1045504):
        fd.write(chunk)

print("File '{0}' saved, {1} bytes".format(file_save, stat(file_save)[6]))

r = client.post(url_logout, data={'csrfmiddlewaretoken': csrftoken})