Python - How to retrieve certain text from a website

Question

I have the following code:

import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import re

market = 'INDU:IND'
quote_page = 'http://www.bloomberg.com/quote/' + market

page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print('Market: ' + name)

This code works and lets me get the market name from the url. I'm trying to do something similar to this website. Here is my code:

market = 'BTC-GBP'
quote_page = 'https://uk.finance.yahoo.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('span', attrs={'class': 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})
name = name_box.text.strip()
print('Market: ' + name)

I'm not sure what to do. I want to retrieve the current rate, the amount it's increased/decreased by as a number & a percentage. And finally as of when the information was updated. How do I do this, I don't mind if you do a different method to the one I used previously as long as you explain it. If my code is inefficient/unpythonic could you also tell me what to do to fix this. I'm pretty new to web scraping and these new modules. Thanks!

The current rate. But I'd also like to have it able to output the increase / decrease of the market as a percentage and a number which are both on the website. And also the time of when the information was uploaded also on the website. — RandomPerson1234554321, Feb 25 '18 at 15:38

score 0 · Answer 1 · answered Feb 25 '18 at 15:47

0

You can directly use api provided by yahoo Finance, For reference check this answer :- Yahoo finance webservice API

answered Feb 25 '18 at 15:47

Apurva N. Saraogi

73
1
2
7

I know but I also want to apply this to other websites, and learn more about web scraping. Thanks though! – RandomPerson1234554321 Feb 25 '18 at 21:09

Ajax1234 · Accepted Answer · 2018-02-26T00:32:03.957

You can use BeautifulSoup and when searching for the desired data, use regex to match the dynamic span classnames generated by the site's backend script:

from bs4 import BeautifulSoup as soup
import requests
import re

data = requests.get('https://uk.finance.yahoo.com/quote/BTC-GBP').text
s = soup(data, 'lxml')
d = [i.text for i in s.find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(\w+\) Fz\(\d+px\) Mb\(-\d+px\) D\(\w+\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})]
date_published = re.findall('As of\s+\d+:\d+PM GMT\.|As of\s+\d+:\d+AM GMT\.', data) 
final_results = dict(zip(['current', 'change', 'published'], d+date_published))

Output:

{'current': u'6,785.02', 'change': u'-202.99 (-2.90%)', 'published': u'As of  3:55PM GMT.'}

Edit: given the new URL, you need to change the span classname:

data = requests.get('https://uk.finance.yahoo.com/quote/AAPL?p=AAPL').text
final_results = dict(zip(['current', 'change', 'published'], [i.text for i in soup(data, 'lxml').find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(b\) Fz\(\d+px\) Mb\(-\d+px\) D\(b\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})] + re.findall('At close:\s+\d:\d+PM EST', data)))

Output:

{'current': u'175.50', 'change': u'+3.00 (+1.74%)', 'published': u'At close:  4:00PM EST'}

Thanks so much! I had to install lxml with pip but after that it worked. — RandomPerson1234554321, Feb 25 '18 at 21:08
Hey when i change the ending of the url from BTC-GBP to AAPL?p=AAPL, the as of bit isn't in the dictionary. Any ideas why? After further testing some of the have the published as of ... and some don't. Could you tell me how to fix this? Thanks! — RandomPerson1234554321, Feb 25 '18 at 21:12
@RandomPerson1234554321 the link with the new addition renders an a page with different `span` classnames. Please see my recent edit. — Ajax1234, Feb 26 '18 at 00:32
It still printing out this: {'current': '178.890', 'change': '+3.390 (+1.932%)'}. The text on the website that I'm trying to get is the bit that says 'As of 11:57AM EST. Market open.'. The class = 'C($c-fuji-grey-j) D(b) Fz(12px) Fw(n) Mstart(0)--mobpsm Mt(6px)--mobpsm' and the ID is 'quote-market-notice'. — RandomPerson1234554321, Feb 26 '18 at 17:00

Python - How to retrieve certain text from a website

2 Answers2