Webscraping in Python

Question

The following code outputs empty lists; I expect it to print the stock price. Any help will be appreciated. Thanks!

import urllib.request
import re
companyList = ["aapl","goog","nflx"]
for i in range(len(companyList)):

    url = "https://finance.yahoo.com/quote/"+companyList[i]+"?p="+companyList[i]
    htmlfile = urllib.request.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35"><!-- react-text: 36 -->()(.+?)<!-- /react-text --></span>'
    pattern = re.compile(regex)
    price = re.findall(pattern, str(htmltext))
    print(price)

I don't see the point of multiple down-votes, especially for newcomers to SO. Perhaps you could tell us what you're trying to extract from that page. Meanwhile, I would suggest that you would be better off using BeautifulSoup, or one of the other means of dealing with webpages, than regex. That approach is fraught with difficulties. — Bill Bell, Sep 06 '17 at 18:15
If you want to send a comment to me, type the '@' sign to get a menu and select my name from the list. — Bill Bell, Sep 06 '17 at 18:18
Please don't parse HTML with regex. You can see this famous (or infamous?) question and answer for details: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — cddt, Sep 07 '17 at 04:22
@BillBell Thanks a lot for the suggestion. I'm trying out Webscraping for the first time, and the tutorial I am following used regex. I'll definitely shift to one on BeautifulSoup. :) — Junaid Khalid, Sep 07 '17 at 09:16
You're welcome. Now, what are you trying to get from that page? — Bill Bell, Sep 07 '17 at 14:23
@BillBell The stock prices of the three companies that are in the 'companyList'. — Junaid Khalid, Sep 07 '17 at 14:25

score 0 · Accepted Answer · answered Sep 07 '17 at 15:13

I'll do it for one of the companies. But I want your firm promise that you won't tell anyone that I've showed you how to do it.

Get a copy of the HTML for the page and save it locally.

>>> import urllib.request
>>> import re
>>> url = 'https://finance.yahoo.com/quote/AAPL/?p=AAPL'
>>> htmlfile = urllib.request.urlopen(url)
>>> htmltext = htmlfile.read()
>>> open('temp.htm', 'w').write(str(htmltext))
533900

Examine the page, and copy-paste the item you want to be able to identify in this and similar pages. Put it in a comment for reference.

>>> # <span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35"><!-- react-text: 36 -->161.38<

Save it in a variable, say, exp.

>>> exp = '<span class="Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="35"><!-- react-text: 36 -->161.38<'

Verify that there are no multiple blank characters in the string. If there are then replace the entire strings of blanks with \s+

>>> exp.find('  ')
-1

Prefix each of the characters in the string that are significant to regex with single '\' characters.

>>> re.sub(r'[().]', lambda m: '\\'+m.group(), exp)
'<span class="Trsdu\\(0\\.3s\\) Fw\\(b\\) Fz\\(36px\\) Mb\\(-4px\\) D\\(ib\\)" data-reactid="35"><!-- react-text: 36 -->161\\.38<'

Display the result and examine it.

>>> regex = '<span class="Trsdu\\(0\\.3s\\) Fw\\(b\\) Fz\\(36px\\) Mb\\(-4px\\) D\\(ib\\)" data-reactid="35"><!-- react-text: 36 -->([^<]+)<'

Use the regex to look for the target item.

>>> re.findall(regex, str(htmltext))
['161.38']

You have my word. Thanks a lot for the help. :D – Junaid Khalid Sep 07 '17 at 15:22 — Junaid Khalid, Sep 07 '17 at 15:22

score 0 · Answer 2 · answered Sep 09 '17 at 18:50

0

See if below script can help. This also covers authentication.

    https://github.com/PraveenKandregula/JenkinsRSSScrappingWithPython/blob/master/JenkinsRSSScrappingWithPython.py

answered Sep 09 '17 at 18:50

Praveen Raj Kumar

198
1
7

Webscraping in Python

2 Answers2