0
import urllib
import re

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for i in range(len(stocks_symbols)):
    htmlfile = urllib.urlopen("https://finance.yahoo.com/q?s=" + stocks_symbols[i])
    htmltext = htmlfile.read(htmlfile)
    regex = '<span id="yfs_l84_' + stocks_symbols[i] + '">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, htmltext)

    regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'
    pattern1 = re.compile(regex1)
    name1 = re.findall(pattern1, htmltext)
    print "Price of", stocks_symbols[i].upper(), name1, "is", price[0]

I guess the problem is in regex1,

regex1 = '<h2 id="yui_3_9_1_9_(.^?))">(.+?)</h2>'

I tried reading documentation but was unable to figure it out.

In this program I trying to scrape Stock-Name and Stock-Price with input of Stock-Symbol as a list.

what I think I am doing is to passing 2 (.+?) in one variable which seems incorrect.

OutPut:

Traceback (most recent call last):
  File "C:\Py\stock\stocks.py", line 14, in <module>
    pattern1 = re.compile(regex1)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 190, in compile
    return _compile(pattern, flags)
  File "C:\canopy-1.4.0.1938.win-x86\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
error: nothing to repeat 
Noorsimar
  • 120
  • 1
  • 2
  • 8
  • Sorry, we don't use regex in web scraping - we use `lxml`, `BeautifulSoup`, `PyQuery`, etc so we can answer for your question. – furas Jul 05 '14 at 14:52
  • Maybe you should take a look at this: https://developer.yahoo.com/yql/console/ You can access yahoo's stock infomrmationn via an SQL-like API – wastl Jul 05 '14 at 15:00
  • I check page and I don't see `

    ` with `id="yui_3_9_1_9_` - there are only `

    – furas Jul 05 '14 at 15:28
  • I actually wanted to scrape Name from symbol, like with AAPL it should say 'Apple Inc. (AAPL)'. And i got it by applying all html code above it. `regex1 = '

    (.+?)

    -NasdaqGS
    `
    – Noorsimar Jul 05 '14 at 17:47

3 Answers3

1

^ matches the start of a string and a ? after that is not a legal regex. If you change your regex to regex1 = '(.+?)' it should work. Note that you also had one parenthesis too much.

Furthermore there is a better way to get yahoo's stock information. You can query a lot of tables (including stock info) with YQL and there is also a YQL-Console where you can try out your queries.

The result you get from there is JSON or XML, which can be handled pretty good via some python libraries.

wastl
  • 2,643
  • 14
  • 27
1

You can extract the price using BeautifulSoup:

import requests
from bs4 import BeautifulSoup
stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for stock in stocks_symbols:
    htmlfile = requests.get("https://finance.yahoo.com/q?s={}".format(stock))
    soup = BeautifulSoup(htmlfile.content)
    price = [x.text for x in soup.findAll("span",id="yfs_l84_{}".format(stock))]
    print ("Price of {}  is {}".format(stock.upper(), price[0]))
Price of AAPL  is 94.03
Price of SPY  is 198.20
Price of GOOG  is 584.73
Price of NFLX  is 472.35
Price of MSFT  is 41.80
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0

Example with requests and lxml and css selection

import requests
import lxml, lxml.cssselect

stocks_symbols = ['aapl', 'spy', 'goog', 'nflx', 'msft']

for symbol in stocks_symbols:

    r = requests.get("https://finance.yahoo.com/q?s=" + symbol)
    html = lxml.html.fromstring(r.text)

    price = html.cssselect('span#yfs_l84_' + symbol)
    print '%s: %s' % (symbol.upper(), price[0].text)

    # there is no `h2` with `id` started wiht "yui_3_9_1_9_"
    # so I can't test this part of code

    #names = html.cssselect('h2[id^="yui_3_9_1_9_"]')
    #for x in names:
    #    print x.text, x.attrib('id')[len('yui_3_9_1_9_'):]

result:

AAPL: 94.03
SPY: 198.20
GOOG: 584.73
NFLX: 472.35
MSFT: 41.80
furas
  • 134,197
  • 12
  • 106
  • 148
  • That's a neat piece of code but the OP was asking for a way to integrate regex. That cssselect stuff doesn't look like regex, just string concatenation – Alkanshel Nov 30 '14 at 23:03