4

I'm trying to web scrape the earnings for each company off SeekingAlpha using BeautifulSoup. However, it seems like the site is detecting that a web scraper is being used? I get a "HTTP Error 403: Forbidden"

The page I'm attempting to scrape is: https://seekingalpha.com/symbol/AMAT/earnings

Does anyone know what can be done to bypass this?

SCB
  • 5,821
  • 1
  • 34
  • 43
user172839
  • 1,035
  • 1
  • 10
  • 19

3 Answers3

6

You should try setting User-Agent as one of request headers. Value can be of any known browser.

Example:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36

Ilija
  • 1,556
  • 1
  • 9
  • 12
4

I was able to access the site contents by using a proxy, found from here:

https://free-proxy-list.net/

Then, creating a playload using the requests module, you can scrape the site:

import requests
import re
from bs4 import BeautifulSoup as soup
r = requests.get('https://seekingalpha.com/symbol/AMAT/earnings', proxies={'http':'50.207.31.221:80'}).text
results = re.findall('Revenue of \$[a-zA-Z0-9\.]+', r)
s = soup(r, 'lxml')
titles = list(map(lambda x:x.text, s.find_all('span', {'class':'title-period'})))
epas = list(map(lambda x:x.text, s.find_all('span', {'class':'eps'})))
deciding = list(map(lambda x:x.text, s.find_all('span', {'class':re.compile('green|red')})))
results = list(map(list, zip(titles, epas, results, epas)))

Output:

[[u'Q4: 11-16-17', u'EPS of $0.93 beat by $0.02', u'Revenue of $3.97B', u'EPS of $0.93 beat by $0.02'], [u'Q3: 08-17-17', u'EPS of $0.86 beat by $0.02', u'Revenue of $3.74B', u'EPS of $0.86 beat by $0.02'], [u'Q2: 05-18-17', u'EPS of $0.79 beat by $0.03', u'Revenue of $3.55B', u'EPS of $0.79 beat by $0.03'], [u'Q1: 02-15-17', u'EPS of $0.67 beat by $0.01', u'Revenue of $3.28B', u'EPS of $0.67 beat by $0.01'], [u'Q4: 11-17-16', u'EPS of $0.66 beat by $0.01', u'Revenue of $3.30B', u'EPS of $0.66 beat by $0.01'], [u'Q3: 08-18-16', u'EPS of $0.50 beat by $0.02', u'Revenue of $2.82B', u'EPS of $0.50 beat by $0.02'], [u'Q2: 05-19-16', u'EPS of $0.34 beat by $0.02', u'Revenue of $2.45B', u'EPS of $0.34 beat by $0.02'], [u'Q1: 02-18-16', u'EPS of $0.26 beat by $0.01', u'Revenue of $2.26B', u'EPS of $0.26 beat by $0.01'], [u'Q4: 11-12-15', u'EPS of $0.29  in-line ', u'Revenue of $2.37B', u'EPS of $0.29  in-line '], [u'Q3: 08-13-15', u'EPS of $0.33  in-line ', u'Revenue of $2.49B', u'EPS of $0.33  in-line '], [u'Q2: 05-14-15', u'EPS of $0.29 beat by $0.01', u'Revenue of $2.44B', u'EPS of $0.29 beat by $0.01'], [u'Q1: 02-11-15', u'EPS of $0.27  in-line ', u'Revenue of $2.36B', u'EPS of $0.27  in-line '], [u'Q4: 11-13-14', u'EPS of $0.27  in-line ', u'Revenue of $2.26B', u'EPS of $0.27  in-line '], [u'Q3: 08-14-14', u'EPS of $0.28 beat by $0.01', u'Revenue of $2.27B', u'EPS of $0.28 beat by $0.01'], [u'Q2: 05-15-14', u'EPS of $0.28  in-line ', u'Revenue of $2.35B', u'EPS of $0.28  in-line '], [u'Q1: 02-11-14', u'EPS of $0.23 beat by $0.01', u'Revenue of $2.19B', u'EPS of $0.23 beat by $0.01']]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thank you. Solution is quite elegant. I will just need to work out how to get the other information as well such as the quarter dates, eps etc on that page. – user172839 Feb 12 '18 at 22:25
  • @user172839 what other pieces of information are you looking for? – Ajax1234 Feb 12 '18 at 22:29
  • I just need all the columns information within that table – user172839 Feb 12 '18 at 22:34
  • @user172839 the Quarters and EPA's? – Ajax1234 Feb 12 '18 at 22:35
  • For example, the first row of that table i want "Q4: 11-16-17 EPS of $0.93 beat by $0.002 revenue of $3.97 b (+203%) beat by $30.00M. Is there an easy of getting them out individually? (Sorry, I'm new to Python). My end result, is I want to scrape a large amount of companies from that list so I can do analysis on the results – user172839 Feb 12 '18 at 22:38
  • Thanks. One last thing. Will there be any issue if I'm scraping the web site say a few thousand times? Or possibly do I need to create delays in between each scrap? – user172839 Feb 12 '18 at 23:08
  • @user172839 you should be fine, unless the site blocks your IP, in which case you would have to use a different proxy. Also, if this answer helped you, please accept it. Thank you! – Ajax1234 Feb 12 '18 at 23:10
  • Excellent. Thank you. Will give it a try later on. – user172839 Feb 12 '18 at 23:27
  • I've tried it and also successfully wrote to a csv file. Thank you! – user172839 Feb 13 '18 at 02:07
  • @user172839 glad to help! – Ajax1234 Feb 13 '18 at 02:13
  • I've come across a small issue. When some of the sub lists are empty, then when you use the zip function, the end result is null. For example, the deciding list for https://seekingalpha.com/symbol/JP/earnings is null, so if u include this in the zip function of the result, nothing gets returned – user172839 Feb 13 '18 at 02:59
0

For anyone out there using PyQuery:

from pyquery import PyQuery as pq
import requests


page = pq('https://seekingalpha.com/article/4151372-tesla-fools-media-model-s-model-x-demand', proxies={'http':'34.231.147.235:8080'})
print(page)
  • (Used proxy info from https://free-proxy-list.net/)
  • Make sure you are using the Requests library and not Urllib. Don't try and load a page with 'urlopen'.
reinaH
  • 534
  • 6
  • 10