BeautifulSoup Scraping U.S. News Today Stock

Asked Jun 23 '18 at 13:46

Active Jun 23 '18 at 19:09

Viewed 710 times

0

Using Python, I am trying to scrap a table of stocks under $10 from U.S. Today Money Stocks Under $10. And then add each element to a list (so that I can iterate through each stock). Currently, I have this code:

`resp = requests.get('https://money.usnews.com/investing/stocks/stocks-under-10') soup = bs.BeautifulSoup(resp.text, "lxml") table = soup.find('table', {'class': 'table stock full-row search-content'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = str(row.findAll('td')[0].text) tickers.append(ticker)`

I keep getting the error:

`Traceback (most recent call last): File "sandp.py", line 98, in <module> sandp(0) File "sandp.py", line 40, in sandp for row in table.findAll('tr')[1:]: AttributeError: 'NoneType' object has no attribute 'findAll'`

python web-scraping beautifulsoup stocks

edited Jun 23 '18 at 19:09
Brian Tompsett - 汤莱恩

5,753

72

57

129

asked Jun 23 '18 at 13:46
Fidel_Willis

31

7

Could you show us how `table` looks like? Just to make sure you're actually getting a result. – Tomas Farias Jun 23 '18 at 14:11

@TomasFarias I added a `print table` line and the terminal displayed `none`. – Fidel_Willis Jun 23 '18 at 14:20

Alright, seems `soup.find('table', {'class': 'table stock full-row search-content'})` can't find a result. Are you sure that's the correct class of the table? Have you checked soup whether you actually are accessing the correct content? Maybe you'll have to pass some header to `requests.get`. – Tomas Farias Jun 23 '18 at 14:23

@TomasFarias Yes it is the correct table – Fidel_Willis Jun 23 '18 at 14:30

1 Answers1

2

The site is dynamic, thus, you can use `selenium`:

from selenium import webdriver import collections from bs4 import BeautifulSoup as soup import re d = webdriver.Chrome('/path/to/chromedriver') d.get('https://money.usnews.com/investing/stocks/stocks-under-10') s = soup(d.page_source, 'lxml') while True: try: d.find_element_by_link_text("Load More").click() #get all data except: break company = collections.namedtuple('company', ['name', 'abbreviation', 'description', 'stats']) headers = [['a', {'class':'search-result-link'}], ['a', {'class':'text-muted'}], ['p', {'class':'text-small show-for-medium-up ellipsis'}], ['dl', {'class':'inline-dl'}], ['span', {'class':'stock-trend'}], ['div', {'class':'flex-row'}]] final_data = [[getattr(i.find(a, b), 'text', None) for a, b in headers] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'search-result flex-row'})] new_data = [[i[0], i[1], re.sub('\n+\s{2,}', '', i[2]), [re.findall('[\$\w\.%/]+', d) for d in i[3:]]] for i in final_data] final_results = [i[:3]+[dict(zip(['Price', 'Daily Change', 'Percent Change'], filter(lambda x:re.findall('\d', x), i[-1][0])))] for i in new_data] new_results = [company(*i) for i in final_results]

Output (first company):

company(name=u'Aileron Therapeutics Inc', abbreviation=u'ALRN', description=u'Aileron Therapeutics, Inc. is a clinical stage biopharmaceutical company, which focuses on developing and commercializing stapled peptides. Its ALRN-6924 product targets the tumor suppressor p53 for the treatment of a wide variety of cancers. It also offers the MDMX and MDM2. The company was founded by Gregory L. Verdine, Rosana Kapeller, Huw M. Nash, Joseph A. Yanchik III, and Loren David Walensky in June 2005 and is headquartered in Cambridge, MA.more\n', stats={'Daily Change': u'$0.02', 'Price': u'$6.04', 'Percent Change': u'0.33%'})

Edit:

All abbreviations:

`abbrevs = [i.abbreviation for i in new_results]`

Output:

`[u'ALRN', u'HAIR', u'ONCY', u'EAST', u'CERC', u'ENPH', u'CASI', u'AMBO', u'CWBR', u'TRXC', u'NIHD', u'LGCY', u'MRNS', u'RFIL', u'AUTO', u'NEPT', u'ARQL', u'ITUS', u'SRAX', u'APTO']`

edited Jun 23 '18 at 16:03

Ajax1234

answered Jun 23 '18 at 14:30
Ajax1234

69,937

8

61

102

I am completely new to this, what is the benefit of using Selenium over BS? – Fidel_Willis Jun 23 '18 at 14:31

@Fidel_Willis When I attempted to access the site with simple `requests`, my request packet was blocked by the site, thus only returning a very small string with `html`. Therefore, calling `BeautifulSoup.find` for a table would return `None`. I assume that the reason you are receiving an `AttributeError` is because of this. The best solution is to use `selenium`, as it runs the necessary client side scripts on the webpage needed to validate the IP, update the `DOM`, etc. However, if your code is returning the full HTML of the page, you can still use my solution starting at line 14. – Ajax1234 Jun 23 '18 at 14:37

Ah I understand now. Thank you. Using your code I get the following error: `os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home Exception AttributeError: "'Service' object has no attribute 'process'" in > ignored` Is this do to having the file in the wrong folder? – Fidel_Willis Jun 23 '18 at 14:44

@Fidel_Willis The path to the driver must be passed to the constructor of `Chrome`. Please see my recent edit. Note, however, that you do not have to use `Chrome`. If you wish to use `Firefox`, you can simply use `webdriver.Firefox` and download the firefox driver. – Ajax1234 Jun 23 '18 at 14:47

After downloading the drivers and such, I finally got it to the right PATH. But now the I get an error referring to permissions: `Message: 'selenium' executable may have wrong permissions`. How do I change these? – Fidel_Willis Jun 23 '18 at 15:15

@Fidel_Willis Strange, I have never run into that error before. Does the program crash when you pass the driver path to the class `Chrome`, or later on, when you attempt to `get` the webpage? Off the top of my head, you could trying running `sudo python your_current_filename.py`. Also, see this link: https://stackoverflow.com/questions/47148872/webdrivers-executable-may-have-wrong-permissions-please-see-https-sites-goo – Ajax1234 Jun 23 '18 at 15:17

I read the linked article and changed the path to an executable so my code looks like this: `d = webdriver.Chrome(executable_path=r'/anaconda2/lib/python2.7/site-packages/selenium/webdriver/chrome/chromedriver.exe')`Yet I still get the PATH error. Is this something because the chrome driver isn't in the correct folder or something? – Fidel_Willis Jun 23 '18 at 15:36

@Fidel_Willis Are you still receiving `wrong permissions` error? Note, however, that `'/anaconda2/lib/python2.7/site-packages/selenium/webdriver/chrome/chromedriver.exe'` does not look like the full path. Instead, locate the executable, and drag-and-drop the file into your command-line/terminal to get the full path. Then, copy the result in your terminal and pass it to the class. – Ajax1234 Jun 23 '18 at 15:41

Thank you so much for all your quick and great help. That did the trick! – Fidel_Willis Jun 23 '18 at 15:56

@Fidel_Willis Glad to help! – Ajax1234 Jun 23 '18 at 16:00

Oh sorry one more thing can I simplify this just to get the abbreviations and put them into a list? Sorry – Fidel_Willis Jun 23 '18 at 16:01

@Fidel_Willis No problem, please see my recent edit. – Ajax1234 Jun 23 '18 at 16:04

Thank you again so much! – Fidel_Willis Jun 23 '18 at 16:21

Question

Using Python, I am trying to scrap a table of stocks under $10 from U.S. Today Money Stocks Under $10. And then add each element to a list (so that I can iterate through each stock). Currently, I have this code:

resp = requests.get('https://money.usnews.com/investing/stocks/stocks-under-10')
soup = bs.BeautifulSoup(resp.text, "lxml")
table = soup.find('table', {'class': 'table stock full-row search-content'})
tickers = []
for row in table.findAll('tr')[1:]:
    ticker = str(row.findAll('td')[0].text)
    tickers.append(ticker)

I keep getting the error:

Traceback (most recent call last):
  File "sandp.py", line 98, in <module>
    sandp(0)
  File "sandp.py", line 40, in sandp
    for row in table.findAll('tr')[1:]:
AttributeError: 'NoneType' object has no attribute 'findAll'

Could you show us how `table` looks like? Just to make sure you're actually getting a result. — Tomas Farias, Jun 23 '18 at 14:11
@TomasFarias I added a `print table` line and the terminal displayed `none`. — Fidel_Willis, Jun 23 '18 at 14:20
Alright, seems `soup.find('table', {'class': 'table stock full-row search-content'})` can't find a result. Are you sure that's the correct class of the table? Have you checked soup whether you actually are accessing the correct content? Maybe you'll have to pass some header to `requests.get`. — Tomas Farias, Jun 23 '18 at 14:23

Ajax1234 · Accepted Answer · 2018-06-23T16:03:48.907

2

The site is dynamic, thus, you can use selenium:

from selenium import webdriver
import collections
from bs4 import BeautifulSoup as soup
import re
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://money.usnews.com/investing/stocks/stocks-under-10')
s = soup(d.page_source, 'lxml')
while True:
  try:
    d.find_element_by_link_text("Load More").click() #get all data
  except:
    break
company = collections.namedtuple('company', ['name', 'abbreviation', 'description', 'stats'])
headers = [['a', {'class':'search-result-link'}], ['a', {'class':'text-muted'}], ['p', {'class':'text-small show-for-medium-up ellipsis'}], ['dl', {'class':'inline-dl'}], ['span', {'class':'stock-trend'}], ['div', {'class':'flex-row'}]]
final_data = [[getattr(i.find(a, b), 'text', None) for a, b in headers] for i in soup(d.page_source, 'html.parser').find_all('div', {'class':'search-result flex-row'})]
new_data = [[i[0], i[1], re.sub('\n+\s{2,}', '', i[2]), [re.findall('[\$\w\.%/]+', d) for d in i[3:]]] for i in final_data]
final_results = [i[:3]+[dict(zip(['Price', 'Daily Change', 'Percent Change'], filter(lambda x:re.findall('\d', x), i[-1][0])))] for i in new_data]
new_results = [company(*i) for i in final_results]

Output (first company):

company(name=u'Aileron Therapeutics Inc', abbreviation=u'ALRN', description=u'Aileron Therapeutics, Inc. is a clinical stage biopharmaceutical company, which focuses on developing and commercializing stapled peptides. Its ALRN-6924 product targets the tumor suppressor p53 for the treatment of a wide variety of cancers. It also offers the MDMX and MDM2. The company was founded by Gregory L. Verdine, Rosana Kapeller, Huw M. Nash, Joseph A. Yanchik III, and Loren David Walensky in June 2005 and is headquartered in Cambridge, MA.more\n', stats={'Daily Change': u'$0.02', 'Price': u'$6.04', 'Percent Change': u'0.33%'})

Edit:

All abbreviations:

abbrevs = [i.abbreviation for i in new_results]

Output:

[u'ALRN', u'HAIR', u'ONCY', u'EAST', u'CERC', u'ENPH', u'CASI', u'AMBO', u'CWBR', u'TRXC', u'NIHD', u'LGCY', u'MRNS', u'RFIL', u'AUTO', u'NEPT', u'ARQL', u'ITUS', u'SRAX', u'APTO']

edited Jun 23 '18 at 16:03

answered Jun 23 '18 at 14:30

Ajax1234

69,937
8
61
102

I am completely new to this, what is the benefit of using Selenium over BS? – Fidel_Willis Jun 23 '18 at 14:31
@Fidel_Willis When I attempted to access the site with simple `requests`, my request packet was blocked by the site, thus only returning a very small string with `html`. Therefore, calling `BeautifulSoup.find` for a table would return `None`. I assume that the reason you are receiving an `AttributeError` is because of this. The best solution is to use `selenium`, as it runs the necessary client side scripts on the webpage needed to validate the IP, update the `DOM`, etc. However, if your code is returning the full HTML of the page, you can still use my solution starting at line 14. – Ajax1234 Jun 23 '18 at 14:37
Ah I understand now. Thank you. Using your code I get the following error: `os.path.basename(self.path), self.start_error_message) selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home Exception AttributeError: "'Service' object has no attribute 'process'" in > ignored` Is this do to having the file in the wrong folder? – Fidel_Willis Jun 23 '18 at 14:44
@Fidel_Willis The path to the driver must be passed to the constructor of `Chrome`. Please see my recent edit. Note, however, that you do not have to use `Chrome`. If you wish to use `Firefox`, you can simply use `webdriver.Firefox` and download the firefox driver. – Ajax1234 Jun 23 '18 at 14:47
After downloading the drivers and such, I finally got it to the right PATH. But now the I get an error referring to permissions: `Message: 'selenium' executable may have wrong permissions`. How do I change these? – Fidel_Willis Jun 23 '18 at 15:15
@Fidel_Willis Strange, I have never run into that error before. Does the program crash when you pass the driver path to the class `Chrome`, or later on, when you attempt to `get` the webpage? Off the top of my head, you could trying running `sudo python your_current_filename.py`. Also, see this link: https://stackoverflow.com/questions/47148872/webdrivers-executable-may-have-wrong-permissions-please-see-https-sites-goo – Ajax1234 Jun 23 '18 at 15:17
I read the linked article and changed the path to an executable so my code looks like this: `d = webdriver.Chrome(executable_path=r'/anaconda2/lib/python2.7/site-packages/selenium/webdriver/chrome/chromedriver.exe')`Yet I still get the PATH error. Is this something because the chrome driver isn't in the correct folder or something? – Fidel_Willis Jun 23 '18 at 15:36
@Fidel_Willis Are you still receiving `wrong permissions` error? Note, however, that `'/anaconda2/lib/python2.7/site-packages/selenium/webdriver/chrome/chromedriver.exe'` does not look like the full path. Instead, locate the executable, and drag-and-drop the file into your command-line/terminal to get the full path. Then, copy the result in your terminal and pass it to the class. – Ajax1234 Jun 23 '18 at 15:41
Thank you so much for all your quick and great help. That did the trick! – Fidel_Willis Jun 23 '18 at 15:56
@Fidel_Willis Glad to help! – Ajax1234 Jun 23 '18 at 16:00
Oh sorry one more thing can I simplify this just to get the abbreviations and put them into a list? Sorry – Fidel_Willis Jun 23 '18 at 16:01
@Fidel_Willis No problem, please see my recent edit. – Ajax1234 Jun 23 '18 at 16:04
Thank you again so much! – Fidel_Willis Jun 23 '18 at 16:21