0

I've been trying to access the .txt files off a website using the Requests module. When I log in using the username and password manually I'm able to see the true data in my browser.

Point Code  Issue Date  Trade Date  Region  Pricing Point   Low High    Average Volume  Deals   Delivery Start Date Delivery End Date
RMTNWW  2018-10-09  2018-10-08  Rocky Mountains Northwest Wyoming Pool  2.910   2.955   2.935   323 44  2018-10-09  2018-10-09
RMTOPAL 2018-10-09  2018-10-08  Rocky Mountains Opal    2.925   3.050   2.965   209 40  2018-10-09  2018-10-09

But when I try accessing the same page with my script and print the content with

print(page.content)

The output comes out as the html source:

   b'<!DOCTYPE html>\n<html>\n<head>\n\n<meta name="csrf-param" content="authenticity_token"/>\n<meta name="csrf-token" content="s35g4TAUN6+5V8Xi8x7u6f2FwziX3gbW9iY9D45HnEw="/>\n<meta http-equiv="content-type" content="text/html;charset=utf-8">
\n<meta name="description" content="Natural Gas Intelligence">\n<meta name="keywords" content="gas, natural gas, natural gas prices, enery prices, NYMEX, nymex settlement, aga, storage, natural gas data, henry hub, ferc, power, electricity, electric, megawatt, methane, reliability, inside, ngi">\n\n\n\n<meta content="false" name="has-log-view" />\n<!--<meta content="IE=EmulateIE7" http-equiv="X-UA-Compatible"/>
    .
    .
    .

Nothing inside this HTML has any of the tags shown above (Point Code, Issue Date, etc...) so I feel this might be a log in problem. The sign on URL is https://www.naturalgasintel.com/user/login whereas the file is located in a path https://www.naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/10/20181009td.txt.

My script is:

import requests
with requests.Session() as c:
    data_url = 'https://naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/'
    username = ''
    password = ''
    login_data = dict(username=username, password=password)
    c.post(data_url, data=login_data, headers={'Referer':'https://www.naturalgasintel.com/'})
    page = c.get('https://www.naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/10/20181009td.txt', stream=True)
    print(page.content)

I'd like to save the actual .txt contents of the page and not the html source using the open function where I can write the contents into a file using something like:

localfile = 'output_{}.csv'
datafile = open(localfile, "w", encoding="utf-8")
datafile.write(page)
datafile.close()

How can I get these contents instead of the html source?

HelloToEarth
  • 2,027
  • 3
  • 22
  • 48
  • https://stackoverflow.com/questions/16694907/how-to-download-large-file-in-python-with-requests-py – Chris Dec 10 '18 at 16:35
  • I've tried this method as well and it still outputs the same HTML source code as local_filename when called to the url. Could it be the log in that is the problem? – HelloToEarth Dec 10 '18 at 18:53
  • You must first login, and than you can use returned cookies to get another content. You cannot expect to be auto logged in. – Boy Dec 10 '18 at 21:57
  • How would I implement cookies to work down the line? Would I first write something like `req1 = requests.post(url)` then `req2 = requests.post(data_url, cookies=req1.cookies)` ? First being `url = 'https://www.naturalgasintel.com/user/login'` and `data_url = 'https://www.naturalgasintel.com/ext/resources/Data-Feed/Daily-GPI/2018/10/20181009td.txt' ?` – HelloToEarth Dec 10 '18 at 22:09
  • First check whether you login successfully. Second, you set `stream=True` so you have to do as Chris said. – KC. Dec 11 '18 at 09:11

0 Answers0