-1

I'm struggling with identifying matching expressions while crawling websites through the re module. I tried crawling multiple websites using Python and noticed that re module's findall function only returned multiple values (for example, expressions with the same class). Is there any way to return the string in an expression like the one below (stock price from cnn.com)? When I tried doing so, I only got an empty array

<span stream="last_36276" streamformat="ToHundredth" streamfeed="SunGard">109.95</span>

Here's my code for crawling cnn money for stock price of apple using Python 3.5.1
Any help is really appreciated:

import urllib.request
import re


with urllib.request.urlopen("http://money.cnn.com/quote/quote.html?symb=AAPL") as url:
    s = url.read()

pattern = re.compile(b'<span stream="last_205778" streamformat="ToHundredth" streamfeed="SunGard">(.+?)</span>')

price=re.findall(pattern,s)

print(price)

#Searching for the first two expressions works, but the last one returns empty array

#<span title="2010-10-19 14:59:01Z" class="relativetime">Oct 19 10 at 14:59</span>

#<span itemprop="upvoteCount" class="vote-count-post ">45</span>

#<span stream="last_205778" streamformat="ToHundredth" streamfeed="SunGard">60.64</span>
I Like
  • 1,711
  • 2
  • 27
  • 52
  • 1
    You should consider using a parser such as `BeautifulSoup` instead of the regex approach... – l'L'l Nov 17 '16 at 23:25
  • 1
    I feel [this answer](http://stackoverflow.com/a/1732454/707650) is the mandatory duplicate. –  Nov 17 '16 at 23:25
  • 1
    It the result is an empty array, it *wasn't* a matching expression. And +1 for parsing HTML with HTML parsers not regex. – jonrsharpe Nov 17 '16 at 23:28
  • 1
    What do you mean by "the first two expressions"? Only the last line matches the regular expression. The others don't have `stream="last_205778"` – Barmar Nov 17 '16 at 23:28

1 Answers1

1

You say you want stream="last_36276", but you are searching for stream="last_205778". The latter is never found on that page, so re.findall() correctly returns an empty list.

Also, you are searching for streamformat, but the actual page has streamFormat. Ditto streamfeed vs streamFeed.

Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • thank you! how did you know streamformat was supposed to be capitalized? It appears as `streamformat` on the cnn page – I Like Nov 18 '16 at 03:21
  • Using Chrome, I visited the URL from your code. I right-clicked and chose "View Page Source". In the HTML, the word is capitalized. – Robᵩ Nov 18 '16 at 05:59