0

I want to scrape the text of an h3 with class as shown in the attached photo.

I modified the code based on the posted recommendation:

import requests
import urllib

session = requests.session()
session.headers.update({
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
  'Accept': '*/*',
  'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
  'Content-Type': 'application/json',
  'Origin': 'https://auth.fool.com',
  'Connection': 'keep-alive',
})

response1 = session.get("https://www.fool.com/secure/login.aspx")
assert response1

response1.cookies
#<RequestsCookieJar[Cookie(version=0, name='_csrf', value='8PrzU3pSVQ12xoLeq2y7TuE1', port=None, port_specified=False, domain='auth.fool.com', domain_specified=False, domain_initial_dot=False, path='/usernamepassword/login', path_specified=True, secure=True, expires=1609597114, discard=False, comment=None, comment_url=None, rest={'HttpOnly': None}, rfc2109=False)]>

params = urllib.parse.parse_qs(response1.url)
params

payload = {
    "client_id": params["client"][0],
    "redirect_uri": "https://www.fool.com/premium/auth/callback/",
    "tenant": "fool",
    "response_type": "code",
    "scope": "openid email profile",
    "state": params["https://auth.fool.com/login?state"][0],
    "_intstate": "deprecated",
    "nonce": params["nonce"][0],
    "password": "XXX",
    "connection": "TMF-Reg-API",
    "username": "XXX",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"



url = "https://auth.fool.com/usernamepassword/login"
response2 = session.post(url, data=formatted_payload)

response2.cookies
#<RequestsCookieJar[]>

response2.cookies is empty thus it seems that the login fails.

seralouk
  • 30,938
  • 9
  • 118
  • 133
  • Do you have the same page got downloaded by the requests lib as you getting in the browser? Content may dynamically being produced by JS for example.. – ilov3 Dec 19 '20 at 22:50
  • It seems that the website needs login info. Have you tried passing the credentials to `requests`? – MendelG Dec 19 '20 at 23:17
  • `findAll("h3", {"class": "content-item-headline"})` – furas Dec 19 '20 at 23:24
  • first check if page works without JavaScript and without login. `requests` and `BeautifulSoup` can't run JavaScipt - it may need [Selenium](https://selenium-python.readthedocs.io/) to control web browser which can runs javaScript. And if you have to login to access this element then you have to use `requests` to login. – furas Dec 19 '20 at 23:25
  • I have edited my post with more details – seralouk Dec 20 '20 at 08:46

1 Answers1

1

I can only give you some partial advice but you might be able to find the "last missing piece" (I have no access to the premium content of your target page). It's correct, that you need to login first, in order to get the content:

What's usually useful is using a session that handles cookies. Also, a proper header often does the trick:

import requests
import urllib

session = requests.session()
session.headers.update({
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0',
  'Accept': '*/*',
  'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
  'Content-Type': 'application/json',
  'Origin': 'https://auth.fool.com',
  'Connection': 'keep-alive',
})

Next we get some cookies for our session from the "official" login page:

response = session.get("https://www.fool.com/secure/login.aspx")
assert response

We will use some of the response URL (yes, there are a couple of redirects) parameters to get a valid payload for the actual login:

params = urllib.parse.parse_qs(response.url)
params

payload = {
    "client_id": params["client"][0],
    "redirect_uri": "https://www.fool.com/premium/auth/callback/",
    "tenant": "fool",
    "response_type": "code",
    "scope": "openid email profile",
    "state": params["https://auth.fool.com/login?state"][0],
    "_intstate": "deprecated",
    "nonce": params["nonce"][0],
    "password": "#pas$w0яδ",
    "connection": "TMF-Reg-API",
    "username": "seralouk@stackoverflow.com",
}
formatted_payload = "{" + ",".join([f'"{key}":"{value}"' for key, value in payload.items()]) + "}"

Finally, we can login:

url = "https://auth.fool.com/usernamepassword/login"
response = session.post(url, data=formatted_payload)

Let me know if you are able to login or if we need to tweak the script. And just some general comments: I normally use an incognito tab to inspect the browser requests an then copy them over to postman where I play around with the parameters and see how they influence the HTTP response. I rarely use Selenium but rather invest the time to build a proper requests tu be used with python's internal library and then use BeautifulSoup.

Edit: After logging in, you can use BeautifulSoup to parse the content of the actual site:

# add BeautifulSoup to our project
from bs4 import BeautifulSoup

# use the session with the login cookies to fetch the data
the_url = "https://www.fool.com/premium/stock-advisor/coverage/tags/buy-recommendation"
data = BeautifulSoup(session.get(the_url).text, 'html.parser')
my_h3 = data.find("h3", "content-item-headline")
Gregor
  • 588
  • 1
  • 5
  • 19
  • thanks for the reply. How can I ask you more in private? – seralouk Dec 22 '20 at 08:15
  • the code works perfectly but ultimately I want to scrap the text data of this website `https://www.fool.com/premium/stock-advisor/coverage/tags/buy-recommendation`. – seralouk Dec 22 '20 at 08:23
  • i tried to integrate [this](https://stackoverflow.com/a/57278079/5025009) but with no luck – seralouk Dec 22 '20 at 08:41
  • 1
    I'd rather discuss your project here so everybody with a similar problem can benefit from our conversation. What I meant with my "partial" advice was that I cannot enter the premium content. But I think you should be able to execute the code above to login and then use the session to get the content, i.e. response = session.get(the_url) and then parse response.text with BeautifulSoup. – Gregor Dec 22 '20 at 19:00
  • I have discovered that the login fails. I have modified my initial post. – seralouk Dec 23 '20 at 14:25