Using Regex to get multiple data on single line by scraping stocks from yahoo

Question

import urllib
import re

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for i in range(len(stocks_symbols)):
    htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
    htmltext = htmlfile.read(htmlfile)
    regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
    pattern1 = re.compile(regex1)
    name1 = re.findall(pattern1, htmltext)
    print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]

I guess the problem is in regex1,

regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'

I tried reading documentation but was unable to figure it out.

In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.

what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.

OutPut:

Traceback (most recent call last):
  File "C:\Py\stock\stocks.py", line 14, in <module>
    pattern1 = re.compile(regex1)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
error: nothing to repeat

Sorry, we don't use regex in web scraping - we use `lxml`, `BeautifulSoup`, `PyQuery`, etc so we can answer for your question. — furas, Jul 05 '14 at 14:52
Maybe you should take a look at this: https://developer.yahoo.com/yql/console/ You can access yahoo's stock infomrmationn via an SQL-like API — wastl, Jul 05 '14 at 15:00
I check page and I don't see `
` with `id="yui_3_9_1_9_` - there are only ` — furas, Jul 05 '14 at 15:28
I actually wanted to scrape Name from symbol, like with AAPL it should say 'Apple Inc. (AAPL)'. And i got it by applying all html code above it. `regex1 = '
(.+?)
-NasdaqGS
` — Noorsimar, Jul 05 '14 at 17:47

score 1 · Accepted Answer · answered Jul 05 '14 at 15:09

^ matches the start of a string and a ? after that is not a legal regex. If you change your regex to regex1 = '(.+?)' it should work. Note that you also had one parenthesis too much.

Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.

The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.

Padraic Cunningham · Answer 2 · 2014-07-05T15:27:33.657

1

You can extract the price using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for stock in stocks_symbols:
    htmlfile = requests.get("https://finance.yahoo.com/q?s={}".format(stock))
    soup = BeautifulSoup(htmlfile.content)
    price = [x.text for x in soup.findAll("span",id="yfs_l84_{}".format(stock))]
    print ("Price of {}  is {}".format(stock.upper(), price[0]))
Price of AAPL  is 94.03
Price of SPY  is 198.20
Price of GOOG  is 584.73
Price of NFLX  is 472.35
Price of MSFT  is 41.80

edited Jul 05 '14 at 15:27

answered Jul 05 '14 at 15:15

Padraic Cunningham

176,452
29
245
321

Nice :) I was working on example with lxml. – furas Jul 05 '14 at 15:26
@furas, thanks, I don't think the source has the other pattern as far as I can see – Padraic Cunningham Jul 05 '14 at 15:29
BeautifulSoup has a lot of bugs, even in bs4. I wound up going back to lxml (shudder) when BS wouldn't parse an entire webpage for the dozenth time due to god-knows-what. – Alkanshel Nov 30 '14 at 23:02

score 0 · Answer 3 · answered Jul 05 '14 at 15:39

Example with requests and lxml and css selection

import requests
import lxml, lxml.cssselect

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for symbol in stocks_symbols:

    r = requests.get("https://finance.yahoo.com/q?s=" + symbol)
    html = lxml.html.fromstring(r.text)

    price = html.cssselect('span#yfs_l84_' + symbol)
    print '%s: %s' % (symbol.upper(), price[0].text)

    # there is no `h2` with `id` started wiht "yui_3_9_1_9_"
    # so I can't test this part of code

    #names = html.cssselect('h2[id^="yui_3_9_1_9_"]')
    #for x in names:
    #    print x.text, x.attrib('id')[len('yui_3_9_1_9_'):]

result:

AAPL: 94.03
SPY: 198.20
GOOG: 584.73
NFLX: 472.35
MSFT: 41.80

That's a neat piece of code but the OP was asking for a way to integrate regex. That cssselect stuff doesn't look like regex, just string concatenation — Alkanshel, Nov 30 '14 at 23:03

Using Regex to get multiple data on single line by scraping stocks from yahoo

` with `id="yui_3_9_1_9_` - there are only `

(.+?)

3 Answers3

Linked

Using Regex to get multiple data on single line by scraping stocks from yahoo

` with `id="yui_3_9_1_9_` - there are only `` with `id="yui_3_9_1_9_`. Maybe `<h2>` are generated by javascript then you can need something more then `urllib` or `requests`

(.+?)

3 Answers3

Linked

` with `id="yui_3_9_1_9_` - there are only `