-1

I have the following code:

import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import re

market = 'INDU:IND'
quote_page = 'http://www.bloomberg.com/quote/' + market

page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print('Market: ' + name)

This code works and lets me get the market name from the url. I'm trying to do something similar to this website. Here is my code:

market = 'BTC-GBP'
quote_page = 'https://uk.finance.yahoo.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('span', attrs={'class': 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})
name = name_box.text.strip()
print('Market: ' + name)

I'm not sure what to do. I want to retrieve the current rate, the amount it's increased/decreased by as a number & a percentage. And finally as of when the information was updated. How do I do this, I don't mind if you do a different method to the one I used previously as long as you explain it. If my code is inefficient/unpythonic could you also tell me what to do to fix this. I'm pretty new to web scraping and these new modules. Thanks!

2 Answers2

0

You can directly use api provided by yahoo Finance, For reference check this answer :- Yahoo finance webservice API

0

You can use BeautifulSoup and when searching for the desired data, use regex to match the dynamic span classnames generated by the site's backend script:

from bs4 import BeautifulSoup as soup
import requests
import re

data = requests.get('https://uk.finance.yahoo.com/quote/BTC-GBP').text
s = soup(data, 'lxml')
d = [i.text for i in s.find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(\w+\) Fz\(\d+px\) Mb\(-\d+px\) D\(\w+\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})]
date_published = re.findall('As of\s+\d+:\d+PM GMT\.|As of\s+\d+:\d+AM GMT\.', data) 
final_results = dict(zip(['current', 'change', 'published'], d+date_published))

Output:

{'current': u'6,785.02', 'change': u'-202.99 (-2.90%)', 'published': u'As of  3:55PM GMT.'}

Edit: given the new URL, you need to change the span classname:

data = requests.get('https://uk.finance.yahoo.com/quote/AAPL?p=AAPL').text
final_results = dict(zip(['current', 'change', 'published'], [i.text for i in soup(data, 'lxml').find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(b\) Fz\(\d+px\) Mb\(-\d+px\) D\(b\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})] + re.findall('At close:\s+\d:\d+PM EST', data)))

Output:

{'current': u'175.50', 'change': u'+3.00 (+1.74%)', 'published': u'At close:  4:00PM EST'}
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • Thanks so much! I had to install lxml with pip but after that it worked. – RandomPerson1234554321 Feb 25 '18 at 21:08
  • Hey when i change the ending of the url from BTC-GBP to AAPL?p=AAPL, the as of bit isn't in the dictionary. Any ideas why? After further testing some of the have the published as of ... and some don't. Could you tell me how to fix this? Thanks! – RandomPerson1234554321 Feb 25 '18 at 21:12
  • @RandomPerson1234554321 the link with the new addition renders an a page with different `span` classnames. Please see my recent edit. – Ajax1234 Feb 26 '18 at 00:32
  • It still printing out this: {'current': '178.890', 'change': '+3.390 (+1.932%)'}. The text on the website that I'm trying to get is the bit that says 'As of 11:57AM EST. Market open.'. The class = 'C($c-fuji-grey-j) D(b) Fz(12px) Fw(n) Mstart(0)--mobpsm Mt(6px)--mobpsm' and the ID is 'quote-market-notice'. – RandomPerson1234554321 Feb 26 '18 at 17:00