Issue in extracting Titanic training data from Kaggle using Jupyter Notebook

Question

I'm trying to extract Titanic training and test data using Jupyter Notebook. Find below my code snippet.

payload = {
    'action': 'login',
    'username': os.environ.get("KAGGLE_USERNAME"),
    'password': os.environ.get("KAGGLE_PASSWORD")
}

url = "https://www.kaggle.com/c/3136/download/train.csv"

with session() as c:
    c.post('https://www.kaggle.com/account/login', data=payload)
    response = c.get(url)
    print(response.text)

After executing this, I'm getting a HTML response instead of training data. I configured my Kaggle login credentials in .env file properly as well. Did I do something wrong here?

What is the HTML response you receive? You probably just need to parse the response. You are making a HTML request so a HTML response is a natural thing to receive. — h0r53, Jun 14 '18 at 18:15
Something on the lines of the following: Kaggle: Your Home for Data Science — sbpkoundinya, Jun 14 '18 at 18:28
Okay that's understandable. Again, you issued a HTTP request and you received a HTTP response - completely normal. You need to parse out the data you are interested in. What exactly do you want to extract from the response? — h0r53, Jun 14 '18 at 18:32
Note that since you'll likely be parsing HTML in python you should look into libraries that assist with that type of thing. There are a few that exist - personally I like https://pythonhosted.org/pyquery/ — h0r53, Jun 14 '18 at 18:33
My code snippet is to login to Kaggle using my credentials, visit the link to Titanic training csv data file and print the same. Does it make sense? Or am I missing something? If so, what exactly is that? — sbpkoundinya, Jun 14 '18 at 18:47
If you maintain a session, which you are, you should be able to subsequently issue a request to the "link to Titanic training". If the data is entire csv then that is what the response text will be after issuing that request. Using session and first using posting to the login page should authenticate and persist your session cookie. — h0r53, Jun 14 '18 at 20:22
Basically, identify the URL to the "link to Titanic training csv" and issue a subsequent request to that URL - after the login request is posted. — h0r53, Jun 14 '18 at 20:22
The link to Titanic training CSV file is the one which I set url variable to in my code snippet. And I did the same, use post to login and get to extract data from CSV file. Can someone please correct my code, if required or let me know how to get the correct solution? — sbpkoundinya, Jun 15 '18 at 01:59
Is there anything in the HTML response to indicate that your login was not successful? I'll either need more of the HTML response from kaggle, or to create my own account, to further assist. — h0r53, Jun 15 '18 at 12:22
please test and verify the provided answer so that this question can be marked as answered instead of forgotten. — h0r53, Jun 21 '18 at 14:18
Possible duplicate of [Download Kaggle Dataset by using Python](https://stackoverflow.com/questions/49386920/download-kaggle-dataset-by-using-python) — Minh Triet, Nov 08 '18 at 18:04

score 3 · Accepted Answer · answered Jun 15 '18 at 13:02

3

The site you are interested in uses AntiForgeryTokens to prevent things like cross-origin-request-forgery. Your login was not successful, which is why your script was not working. The AF Tokens present an obstacle, but nothing we cannot overcome with the magic of Python. I made an account and I'm successfully pulling down the CSV data you desire with the following script. Note: I had to parse the AntiForgeryToken and my code to do so is a bit messy, but it works.

import requests

payload = {
    '__RequestVerificationToken': '',
    'username': 'OMITTED',
    'password': 'OMITTED',
    'rememberme': 'false'
}

loginURL = 'https://www.kaggle.com/account/login'
dataURL = "https://www.kaggle.com/c/3136/download/train.csv"

with requests.Session() as c:
    response = c.get(loginURL).text
    AFToken = response[response.index('antiForgeryToken')+19:response.index('isAnonymous: ')-12]
    print("AntiForgeryToken={}".format(AFToken))
    payload['__RequestVerificationToken']=AFToken
    c.post(loginURL + "?isModal=true&returnUrl=/", data=payload)
    response = c.get(dataURL)
    print(response.text)

answered Jun 15 '18 at 13:02

h0r53

3,034
2
16
25

Thanks a lot for your help. I'll try and let you know if it works for me. – sbpkoundinya Jun 16 '18 at 08:28
I tried this and got a HTML response again. Do I need to extract the body using BeautifulSoup? Please let me know the correct syntax. – sbpkoundinya Jun 22 '18 at 19:10
First login through the site and download the dataset manually, the HTML response is probably an acknowledgement that you won't abuse the site. After you've accepted the disclaimer, this should work. – h0r53 Jun 22 '18 at 20:08
You're absolutely correct. After I accepted the disclaimer, the code worked fine. The data got displayed properly. Thank you very much for your assistance. – sbpkoundinya Jun 24 '18 at 17:12
I'm happy that I could help. I enjoy scripting things like this in python. Please mark the answer as resolved and consider an up-vote if you think the answer is satisfactory. – h0r53 Jun 25 '18 at 12:39
I am getting HTML response even after accepting the disclaimer – Binod Mathews Dec 11 '18 at 09:51
@BinodMathews any way to get the json?, even i am getting html response still... – Manoj Yadav Apr 25 '20 at 16:13
@h0r53 can you please help on this? i tried all that you mentioned, i am getting the AFToken as well but when i do print(response.text) it still prints the html response – Manoj Yadav Apr 25 '20 at 16:16

Issue in extracting Titanic training data from Kaggle using Jupyter Notebook

1 Answers1

Linked