How to display proper output when using re.findall() in python?

Question

I made a python script to get the latest stock price from the Yahoo Finance.

import urllib.request
import re

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");

htmltext = htmlfile.read();

price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext);
print(price);

It works smoothly but when I output the price it comes out like this
[b'1,217.04']

This might be a petty issue to ask but I'm new to python scripting so please help me if you can.

I want to get rid of 'b'. If I remove 'b' from b'<span id="yfs_l84_goog">"then it shows this error.

File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

I want the output to be just

1,217.04

score 3 · Accepted Answer · edited May 23 '17 at 12:03

b'' is a syntax for bytes literals in Python. It is how you could define byte sequences in Python source code.

What you see in the output is the representation of the single bytes object inside the price list returned by re.findall(). You could decode it into a string and print it:

>>> for item in price:
...     print(item.decode()) # assume utf-8
... 
1,217.04

You could also write bytes directly to stdout e.g., sys.stdout.buffer.write(price[0]).

You could use an html parser instead of a regex to parse html:

#!/usr/bin/env python3
import cgi
from html.parser import HTMLParser
from urllib.request import urlopen

url = 'http://finance.yahoo.com/q?s=GOOG'

def is_price_tag(tag, attrs):
    return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog'

class Parser(HTMLParser):
    """Extract tag's text content from html."""
    def __init__(self, html, starttag_callback):
        HTMLParser.__init__(self)
        self.contents = []
        self.intag = None
        self.starttag_callback = starttag_callback
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.intag = self.starttag_callback(tag, attrs)
    def handle_endtag(self, tag):
        self.intag = False
    def handle_data(self, data):
        if self.intag:
            self.contents.append(data)

# download and convert to Unicode
response = urlopen(url)
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html = response.read().decode(params['charset'])

# parse html (extract text from the price tag)
content = Parser(html, is_price_tag).contents[0]
print(content)

Check whether yahoo provides API that doesn't require web scraping.

It worked great using item.decode(), does it convert bytes to string? — Saurabh Rana, Feb 25 '14 at 16:46
@SaurabhRana: yes. `some_bytes.decode(character_encoding) == some_unicode_text` — jfs, Feb 25 '14 at 16:48

score 0 · Answer 2 · answered Feb 25 '14 at 15:52

0

Okay after searching for a while. I found the solution.Works fine for me.

import urllib.request 
import re

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");

htmltext = htmlfile.read();

pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>');

price = pattern.findall(str(htmltext));
print(price);

answered Feb 25 '14 at 15:52

Saurabh Rana

3,350
2
19
22

1

don't call `str(some_bytes_object_that_might_be_text)` unconditionally. If the page is not encoded in `sys.getdefaultencoding()` then `str()` may produce a wrong result (sometimes silently). You could [use `Content-Type` header to get character encoding](http://stackoverflow.com/a/22020377/4279) – jfs Feb 25 '14 at 16:36

How to display proper output when using re.findall() in python?

2 Answers2