Web scraping results in 403 Forbidden Error

Question

I'm trying to web scrape the earnings for each company off SeekingAlpha using BeautifulSoup. However, it seems like the site is detecting that a web scraper is being used? I get a "HTTP Error 403: Forbidden"

The page I'm attempting to scrape is: https://seekingalpha.com/symbol/AMAT/earnings

Does anyone know what can be done to bypass this?

score 6 · Answer 1 · answered Feb 12 '18 at 22:11

6

You should try setting User-Agent as one of request headers. Value can be of any known browser.

Example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

answered Feb 12 '18 at 22:11

Ilija

1,556
1
9
12

How would you do it? – Revolucion for Monica Nov 01 '20 at 15:00

Ajax1234 · Accepted Answer · 2018-02-12T22:52:18.223

4

I was able to access the site contents by using a proxy, found from here:

https://free-proxy-list.net/

Then, creating a playload using the requests module, you can scrape the site:

import requests
import re
from bs4 import BeautifulSoup as soup
r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
results = re.findall('Revenue of \$[a-zA-Z0-9\.]+', r)
s = soup(r, 'lxml')
titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
results = list(map(list, zip(titles, epas, results, epas)))

Output:

[[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]

edited Feb 12 '18 at 22:52

answered Feb 12 '18 at 22:21

Ajax1234

69,937
8
61
102

Thank you. Solution is quite elegant. I will just need to work out how to get the other information as well such as the quarter dates, eps etc on that page. – user172839 Feb 12 '18 at 22:25
@user172839 what other pieces of information are you looking for? – Ajax1234 Feb 12 '18 at 22:29
I just need all the columns information within that table – user172839 Feb 12 '18 at 22:34
@user172839 the Quarters and EPA's? – Ajax1234 Feb 12 '18 at 22:35
For example, the first row of that table i want "Q4: 11-16-17 EPS of $0.93 beat by $0.002 revenue of $3.97 b (+203%) beat by $30.00M. Is there an easy of getting them out individually? (Sorry, I'm new to Python). My end result, is I want to scrape a large amount of companies from that list so I can do analysis on the results – user172839 Feb 12 '18 at 22:38
Thanks. One last thing. Will there be any issue if I'm scraping the web site say a few thousand times? Or possibly do I need to create delays in between each scrap? – user172839 Feb 12 '18 at 23:08
@user172839 you should be fine, unless the site blocks your IP, in which case you would have to use a different proxy. Also, if this answer helped you, please accept it. Thank you! – Ajax1234 Feb 12 '18 at 23:10
Excellent. Thank you. Will give it a try later on. – user172839 Feb 12 '18 at 23:27
I've tried it and also successfully wrote to a csv file. Thank you! – user172839 Feb 13 '18 at 02:07
@user172839 glad to help! – Ajax1234 Feb 13 '18 at 02:13
I've come across a small issue. When some of the sub lists are empty, then when you use the zip function, the end result is null. For example, the deciding list for https://seekingalpha.com/symbol/JP/earnings is null, so if u include this in the zip function of the result, nothing gets returned – user172839 Feb 13 '18 at 02:59

score 0 · Answer 3 · answered Feb 28 '18 at 07:18

0

For anyone out there using PyQuery:

from pyquery import PyQuery as pq
import requests


page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
print(page)

(Used proxy info from https://free-proxy-list.net/)
Make sure you are using the Requests library and not Urllib. Don't try and load a page with 'urlopen'.

answered Feb 28 '18 at 07:18

reinaH

534
6
10

1

*"Don't try and load a page with 'urlopen'"* why? – starwarswii Jul 06 '18 at 15:52

Web scraping results in 403 Forbidden Error

3 Answers3

Linked