0

I am new to web scraping and would like to learn how to do it properly and politely. My problem is similar to [this][1].

'So I am trying to log into and navigate to a page using python and requests. I'm pretty sure I am getting logged in, but once I try to navigate to a page the HTML I print from that page states you must be logged in to see this page.'

I've checked robots.txt of the website I would like to scrape. Is there something which prevents me from scraping? User-agent: * Disallow: /caching/ Disallow: /admin3003/ Disallow: /admin5573/ Disallow: /members/ Disallow: /pp/ Disallow: /subdomains/ Disallow: /tags/ Disallow: /templates/ Disallow: /bin/ Disallow: /emails/

My code with the solution from the link above which does not work for me:

import requests
from bs4 import BeautifulSoup

login_page = <login url>
link = <required url>

payload = {
    “username” = <some username>,
    “password” = <some password> 

} 

p = requests.post(login_page, data=payload)       
cookies = p.cookies
page_response = requests.get(link, cookies=cookies)
page_content = BeautifulSoup(page_response.content, "html.parser")

RequestsCookieJar shows Cookie ASP.NET_SessionId=1adqylnfxbqf5n45p0ooy345 for WEBSITE (with p.cookies command)

Output of p.status_code : 200

UPDATE:

s = requests.session()

doesn't solve my problem. I had tried this before I started looking into cookies.

Update 2: I am trying to collect news from a particular web site. First I filtered news with a search word and saved links appeared on the first page with python requests + beautifulsoup. Now I would like to go through the links and extract news from them. The full text is possible to see with credentials only. There is no special login window and it's possible to log in via any page. There is a login button and when one move a mouse to that a login window appears as in attached image. I tried to login in both via the main page and via the page from which I would like to extract a text (not at the same time, but in different trials). None of this works. I also tried to find csrf token by searching for “csrf_token”, “authentication_token”, “csrfmiddlewaretoken”, :csrf", "auth". Nothing was found in html on the web pages.Image

Mirit
  • 33
  • 3
  • 9
  • you need to use `session`, check here https://stackoverflow.com/questions/12737740/python-requests-and-persistent-sessions – Stack Nov 13 '18 at 16:40
  • 1
    Possible duplicate of [Python Requests and persistent sessions](https://stackoverflow.com/questions/12737740/python-requests-and-persistent-sessions) – Antwane Nov 13 '18 at 16:51

1 Answers1

1

You can use requests.Session() to stay logged in but you have to save the cookie for the login as a json file. The example below shows a scrapping code that saves login session to facebook as a cookie in json format;

import selenium
import mechanicalsoup
import json
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import requests
import time

s = requests.Session()

email = raw_input("Enter your facebook login username/email: ")
password = raw_input("Enter your facebook password: ")

def get_driver():
    driver = webdriver.Chrome(executable_path = 'your_path_to_chrome_driver')
    driver.wait = WebDriverWait(driver, 3)
    return driver

def get_url_cookie(driver):
    dirver.get('https://facebook.com')
    dirver.find_element_by_name('email').send_keys(email)
    driver.find_element_by_name('pass').send_keys(password)
    driver.find_element_by_id('loginbutton').click()
    cookies_list= driver.get_cookies()
    script = open('facebook_cookie.json','w')
    json.dump(cookies_list,script)

driver = get_driver()
get_url_cookie(driver)

The code above gets you the login session cookie using the driver.get_cookies() and saves it as a json file. To use the cookie, just load it using;

with open('facebook_cookie.json') as c:
    load = json.load(c)
for cookie in load:
    s.cookie.set(cookie['name'],cookie['value'])
url = 'facebook.com/the_url_you_want_to_visit_on_facebook'
browser= mechanicalsoup.StatefulBrowser(session=s)
browser.open(url)

and you get your session loaded...

DevTotti
  • 86
  • 6