0

I am trying to access the page https://seekingalpha.com/api/v3/symbols/hsy/press-releases using python requests.

If I go manually to the page, open the devtools panel, and check the requests https://seekingalpha.com/api/v3/news?filter[category]=market-news%3A%3Aall&page[size]=5, I can copy-paste the request headers that contains the cookie of the site, and by manually setting those I am able to then reach the webpage using requests:

headers = {
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,fr;q=0.7',
'cache-control': 'no-cache',
'cookie' : 'machine_cookie=0427562897246; _pxvid=3a123ba7-3e5d-11eb-a61d-0242ac120017; prism_25946650=8900e9d7-b37b-4d84-9487-0201d2065590; _ga=GA1.2.747536612.1607985570; _gid=GA1.2.1375655082.1607985570; _gcl_au=1.1.785293143.1607985570; __tbc=%7Bjzx%7DlCp-P5kTOqotpgFeypItnMNfCqB03Jnfrv3KQZtXwF3ncfmaDQ98SBay3PXmCBnvPDoBVUmuXm8FouQ0JGElHQT-keiLYDix8RYL4SoUOxJjMub4h3TpZRVQ_edVMb61UVgFAp6l8Mpn6PJS7yCpyA; __pat=-18000000; __pvi=%7B%22id%22%3A%22v-2020-12-14-22-39-29-611-1jo5IsQ7wmgpITTS-3f23a5404637a2a40a85f9ea30050d82%22%2C%22domain%22%3A%22.seekingalpha.com%22%2C%22time%22%3A1607987120778%7D; xbc=%7Bjzx%7DTsLlDv3TXKwd1pAfboMsSnAV3s9R4OnJHOTGW34XfqHE9XguV0cuq-tg-wpJwicWtq5BbeakKu9-e2k9mudI9_nZX365XWEAIEiYbfoRgRsjmdC0GsUSh9_Z0HBjeiY1JY4_tnmYAU4S-z_H3LEmfMyTffbP-zyj1qTHoxeuH9Mm0Ce7LB5xgxX03a65iNmBWhmboGNXjyyWjs7SwY402e_Sk1_4O4l073jBmh9jRLU7AkV6QBL22p2g1qgC78KI12HAOLlDFhRc1QuLjNNzU7G1D3QVi6NamFxveoczdabIhqbAgRSqMRR8tMk-PavGOusVNURIs0m9avquxB0LjhuCIeBKg2K3IABSmyH1pFhZGath0E2HTTJ8ueb6Yj_0oQ8OBqx1YI9l4eFkdPJt06y1_boHQhYNOgCT6OewGdj-ZCEbP5w3D3aSfBbdCgXNgKh2Ys34RMi11ejU7r0TEOdd21h_kWrMpZw7qlE3_Xh9HhtaWjujLnCTpXPgAVId; _uetsid=3a9a15903e5d11ebb0c8cbc81eeed304; _uetvid=3a9a89703e5d11ebb9f4cd856d5d177f; _px2=eyJ1IjoiZDZhY2JmNDAtM2U2MC0xMWViLTljMGItMWQ0OGRjOWJlNDA2IiwidiI6IjNhMTIzYmE3LTNlNWQtMTFlYi1hNjFkLTAyNDJhYzEyMDAxNyIsInQiOjE2MDgwMDgwNTQyMTIsImgiOiIzYTQzMTU3MGJkMzE2ZmQ1YTVkZjM1ZjFmZWU3NTQxYjJiMTcxZmY5M2I0NDUyYzQ1YTFiZGNhOGFiYmRiMDFlIn0=; _px=+bUqf8l/WIbrt+qNCDX+18JknkuO9/05f6FMm402KUELBnVmyufZp2ExW6YDfg8Qu+eI3ae73PcqrVn+numnTQ==:1000:7m/qEw8v5Fh6e0zEdth41JR4ArTi5emjJZWnzK1p2ZznQQQpHdKInTpt8i272JpgAUaJ1jO25sNB4p72C5WOwNgCAyxzECTWG/Mws+llWhTXPmBNGMZFuHCc1P3YPOs4ffSGTx078fuE28EFuQIC3sDnhQum+tIxxwH5UHZkRwiGvL0whtVhUyFsfpdtwPabudbmriBXFvMDq8TOPZPpLzOKVzOzXDVrscLXMpEw14UisbsjBksCU4MhYyRmF03JH2lPI6SbTo8unDxeJhIKZg==; _pxde=3909397fae9c6c84b8595d0ca41405600414dec85ef42530c75d2d03d38258a8:eyJ0aW1lc3RhbXAiOjE2MDgwMDc1NTQyMTIsImZfa2IiOjB9; session_id=80ba4e65-7eda-4b97-b0c7-a1262e40ed4e',
'pragma': 'no-cache',
'referer': 'https://seekingalpha.com/symbol/HSY/press-releases',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
}

requests.get(url, headers = headers)

However that cookie will expire after few hours and then I will need to manually perform again the same process if i want to use the script again.

I was hoping that it would be possible to emulate the same path that a human/manual does on the site with requests, i.e. first going to the gate page https://seekingalpha.com/, having the cookie being set on that page, and then being able to reach the target page.

Something like

headers = {'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'}
session = requests.get('https://seekingalpha.com/', headers = headers)
session.get('https://seekingalpha.com/api/v3/symbols/hsy/press-releases')

However doing so i receive error 403. I have tried to inspect the network panel in the devtools to find which is the http requests that contains Set-Cookies using F5 but somehow i couldnt find it (I did use this approach in the past with some success)

jim jarnac
  • 4,804
  • 11
  • 51
  • 88

2 Answers2

1

Maybe this helps you or maybe it won't but you can use Postman interceptor functionality to capture the requests done by chrome. You can use that to see the actual requests that are done and maybe it will shed some light to help you resolve the issue.

MarioXbrl
  • 11
  • 2
1

Your header is wrong, try it like this:

import requests
import json
s=requests.Session()
url="https://seekingalpha.com/api/v3/symbols/hsy/press-releases"
s.headers={
    "accept": "application/json, text/plain, */*",
    "accept-language": "en-US,en;q=0.9",
    "cache-control": "no-cache",
    "pragma": "no-cache",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-site",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
  }
r=s.get(url)
print(r.text)

After you can parse response:

m=json.loads(r.text)