I am new to web scraping and would like to learn how to do it properly and politely. My problem is similar to [this][1].
'So I am trying to log into and navigate to a page using python and requests. I'm pretty sure I am getting logged in, but once I try to navigate to a page the HTML I print from that page states you must be logged in to see this page.'
I've checked robots.txt of the website I would like to scrape. Is there something which prevents me from scraping? User-agent: * Disallow: /caching/ Disallow: /admin3003/ Disallow: /admin5573/ Disallow: /members/ Disallow: /pp/ Disallow: /subdomains/ Disallow: /tags/ Disallow: /templates/ Disallow: /bin/ Disallow: /emails/
My code with the solution from the link above which does not work for me:
import requests
from bs4 import BeautifulSoup
login_page = <login url>
link = <required url>
payload = {
“username” = <some username>,
“password” = <some password>
}
p = requests.post(login_page, data=payload)
cookies = p.cookies
page_response = requests.get(link, cookies=cookies)
page_content = BeautifulSoup(page_response.content, "html.parser")
RequestsCookieJar shows Cookie ASP.NET_SessionId=1adqylnfxbqf5n45p0ooy345 for WEBSITE (with p.cookies command)
Output of p.status_code : 200
UPDATE:
s = requests.session()
doesn't solve my problem. I had tried this before I started looking into cookies.
Update 2: I am trying to collect news from a particular web site. First I filtered news with a search word and saved links appeared on the first page with python requests + beautifulsoup. Now I would like to go through the links and extract news from them. The full text is possible to see with credentials only. There is no special login window and it's possible to log in via any page. There is a login button and when one move a mouse to that a login window appears as in attached image. I tried to login in both via the main page and via the page from which I would like to extract a text (not at the same time, but in different trials). None of this works. I also tried to find csrf token by searching for “csrf_token”, “authentication_token”, “csrfmiddlewaretoken”, :csrf", "auth". Nothing was found in html on the web pages.Image