0

Here iam using 're' module in python to crawl a web page and there are 4 iterations and after each iteration it is returning with empty array like this [''] but the output should be the stock price of desired stock symbol.There is no error in regex variable as it is printing correctly.The source code is included below.

import urllib
import re

symbolslist = ["appl","spy","goog","nflx"]

i=0
while i<len(symbolslist):
        url ="http://in.finance.yahoo.com/q?s=" +symbolslist[i] +"&ql=1"
        htmlfile = urllib.urlopen(url)
        htmltext = htmlfile.read()
        regex ='<span id="yfs_l84_'+symbolslist[i] +'">(.+?)</span>'
        pattern = re.compile(regex)
        print regex
        price = re.findall(pattern,htmltext)
        print "price of ",symbolslist[i],"is",price
        i+=1

And in the output there is no syntax or indentation error and output looks like this

<span id="yfs_l84_appl">(.+?)</span>
price of  appl is []
<span id="yfs_l84_spy">(.+?)</span>
price of  spy is []
<span id="yfs_l84_goog">(.+?)</span>
price of  goog is []
<span id="yfs_l84_nflx">(.+?)</span>
price of  nflx is []

In the array the value of the stock is not printing

Web Page crawled is https://in.finance.yahoo.com/q?s=NFLX&ql=0

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
SaiKiran
  • 6,244
  • 11
  • 43
  • 76
  • 2
    Use an html parser like BeautifulSoup to parse html! – Daniel Feb 14 '16 at 13:02
  • Print the html text variable before running regex on it. I would not be surprised, if the actual stock price is updated using AJAX and it is not in the HTML you receive – jpou Feb 14 '16 at 13:02
  • @jpou the same program is working for a single stock symbol and is failing if we run a array on the same – SaiKiran Feb 14 '16 at 13:04

1 Answers1

1

As an alternative approach, you might find it easier to use the yahoo_finance library as follows:

from yahoo_finance import Share

for symbol in ["appl", "spy", "goog", "nflx"]:
    yahoo = Share(symbol)
    print 'Price of {} is {}'.format(symbol, yahoo.get_price())

Giving you the following output:

Price of appl is 96.11
Price of spy is 186.63
Price of goog is 682.40
Price of nflx is 87.40

It is never a wise move to try and parse HTML data using regular expressions.


Another approach would be to extract the information first using BeautifulSoup:

from bs4 import BeautifulSoup
import requests
import re

for symbol in ["appl", "spy", "goog", "nflx"]:
    url = 'http://finance.yahoo.com/q?s={}'.format(symbol)
    r = requests.get(url)
    soup = BeautifulSoup(r.text, "html.parser")

    data = soup.find('span', attrs= {'id' : re.compile(r'yfs_.*?_{}'.format(symbol.lower()))})
    print 'Price of {} is {}'.format(symbol, data.text)
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • But my approach is to use regex module only for now – SaiKiran Feb 14 '16 at 12:57
  • 1
    @SaiKiranUppu: [***N̵̻̹̪͉͔͍̞̲̹͑͛́̆̈́̆̈͌͗͊ͅḘ̣̪̣̺͓̟͛̅́̇͗̂͑͋̒͛ͅV̵̢̢̫͇͍̩̳͖̼̌̑̒͋͠E̷̘̱̦͓̝̝̫͎̼͛͗͋̉͛̾̚͟Ṙ̘͕̭̲̠͛͌̐̀͞ U̸̢̲̭̜͈̰̇́̽̾͗̉̕͝ͅS͙̥̙͖̏̆́͗̀̉͜͞E̛̺̜̪̠͕̅̏̏͑̆̄̕̕̕ Ŗ̥̼̘̯͉̻̻̏̓͒́͛̔͜͡Ȩ̵̦̙̰̼̘̰̲͎͗͗͛̉͐̎̕͡G̴̡͔͓͚̺̗̟̟͈̎͛͂̽̄ͅÈ̷̡͈͎̣̫̟̊͋̇́̕X̨̨̝̣̋̀͆̽̅̽̄͟ Ṗ̛̳͈͚̞̤̹̭̦̣̽̀̈Ą̦̫͎̪̠͉̫͋̐̉̏͌̓̔̓̉Ŗ̼̮̥͕͎͈̦̔̌͑̊̿S̸̢̮̳̱̗̗͖̋̿̑̊̇̄̀E̴̙̱̳͉̍̽͗̆̐̆͢͟͝͝ H̦̝͕͔̙̙̳͍̻̋̑̈̇̐̾̀͝͡͠ͅŢ̵̧͎̭̣̪͎̞̳̄̀͒̒̋̚͝M̸̢̻̱̖͈͕͊̌͌̋̚͟L̷͕̣̻̗̞̐̈́̓̐̀̈́̾͜͜͠͝***](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Remi Guan Feb 14 '16 at 13:15
  • Click on the link, it will take you to the website where you can download and install the library. – Martin Evans Feb 14 '16 at 13:27