1

I am trying to monitor a page for any updates. However, I need to keep the same session and cookies so I can't just send a whole new request.

How can I check for updates in the HTML within my current request? The page won't just be updated, I will be redirected but the URL remains the same.

Here is my current code:

import requests

url = 'xxx'

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) 
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}

response = requests.get(url, headers=headers, allow_redirects=True, config={'keep_alive': True})


def get_status():
    html = response.text # this should be the current HTML, not the HTML when I made the initial request
    if x in html:
        status = "exists"
    else:
        status = "null"

return status


print(get_status())

EDIT: I will be using a while loop to run this function every 5 seconds to check if the status is = "exists".

EDIT2: I tried to implement it via requests_html but I am not getting as many cookies as I should be:

import requests_html
from requests_html import HTMLSession

session = HTMLSession()
session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'})
r = session.get('x')
r.html.render(reload=False)
print(r.cookies.get_dict())
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
J P
  • 541
  • 1
  • 10
  • 20
  • `if x in html` what is `x`? – Bubble Bubble Bubble Gut Apr 17 '18 at 13:33
  • @Ding I'm trying to search for changes in the HTML based on a keyword that I know is on the redirected page but isn't on the initial one. I was initially going to check for changes in URL but the URL remains the same. – J P Apr 17 '18 at 13:34
  • What do you want to do ? Because it seems unclear to me what you're asking. You want to compare 2 HTML copies ? – IMCoins Apr 17 '18 at 13:36
  • @IMCoins I want to visit a website using python requests, wait until that page updates and then print that it has changed. So I would use a while loop with my code above to keep checking every 5 seconds to see if the status is "exits". The reason I want to check HTML every 5 seconds is because I know that the initial page doesn't have that HTML, but the second one does. I can't just check for changes in the URL since it remains the same. – J P Apr 17 '18 at 13:38
  • @IMCoins And I can't just send a new request every 5 seconds because the site will give me cookies etc and I want the session to remain the same. – J P Apr 17 '18 at 13:40
  • You want to send a request every 5 seconds, but you can't send a request every 5 seconds ? You might want to use [Selenium](https://www.seleniumhq.org/) instead of requests. I suggest you to scrap some webpage, and refresh it every now and then, and scrap it again. – IMCoins Apr 17 '18 at 13:42
  • @IMCoins I think I'm searching for functionality that doesn't exist. My problem is that the change in HTML I'm looking for is prompted by me receiving a cookie. The way the cookie is generated is if I'm on the website for a certain period of time (e.g. 1 hour), so that is why I would like to be able to keep a request open (as if I'm actually on the website), and check for those changes. Selenium isn't lightweight enough unfortunately. – J P Apr 18 '18 at 17:27

1 Answers1

0

However, I need to keep the same session and cookies so I can't just send a whole new request.

What you want to do here is open a session using

s = requests.Session()
response = s.get("http://www.google.com")

This will make sure to persist cookies and certain other things across requests. Navigate to the documentation of Sessions for further details.

As you simply want to check whether the returned html is the exact same as in the previous request, simply save the first response.text outside your function and check whether your new response.text equals to the one saved earlier.

If the website displays any content dynamically, this will of course not do the trick but if you can check for a specific element in the DOM and compare it to the object from the previous request, this will work just fine.

lobetdenherrn
  • 303
  • 1
  • 8
  • So I loop over that every 5 seconds (i.e. a new requests.Session()) and I just transfer the cookies from the previous session to the new one. Also, I check the html of every new session to see if the thing I am looking for exists? I'm not sure exactly how the website works, but it gives you a cookie after a certain period of time like 1 hour, and that cookie will then change the HTML which is what I am detecting. If I'm sending a new request every 5 seconds rather than staying on the site, will that not prevent them from giving me that cookie? – J P Apr 18 '18 at 17:23
  • I think maybe I misinterpreted something, so do I only do s = requests.Session() once, and then every 5 seconds I do s.get(x) and response.text? – J P Apr 18 '18 at 17:28
  • The latter comment is correct. As I mentioned above, the session is used to "persist cookies and certain other things across requests." Therefore you open a Session in your main code and then in the loop use this session object to execute your HTTP calls. Don't forget to close your session afterwards if the loop is not an infinite loop. – lobetdenherrn Apr 19 '18 at 11:07
  • Thanks. I tried to implement this into my code, but the amount of cookies being returned are about 1/8 of the amount I get when I visit the website on my browser. Any ideas why? (I added my code to the post). – J P Apr 20 '18 at 20:49
  • This can have many very different reasons. One would be linked to the question raised [here](https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa). Many cookies in your browser are set based on client side javascript code and therefore will not appear in your session object in python, as requests cannot simply execute javascript code. If you would like to "simulate" a browser, you should probably use [Selenium](https://pypi.python.org/pypi/selenium). – lobetdenherrn Apr 22 '18 at 12:21
  • But requests_html can render JS? – J P Apr 22 '18 at 15:24
  • As sberry put it quite nicely [here](https://stackoverflow.com/questions/26393231/using-python-requests-with-javascript-pages?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa): "No, Requests is an http library. It cannot run javascript." Of course it can retrieve you the code that the site contains, but it will never execute that code. – lobetdenherrn Apr 24 '18 at 13:30