Parsing Large Chunks of HTML with re.findall()

Question

I have a project that I am working on at home that uses the rottentomatoes API to gather movies currently in theaters. It then gathers all images on those movies' imdb page. The issue I am having trouble with is the gathering of the images.. The goal here is to get this code to run under 8 seconds, but the regex command and am running is taking forever! Currently I am using a regular expression:

re.findall('<img.*?>', str(line))

where line is a chunk of HTML

Does anyone have a better regex expression that they can think of (perhaps more refined?) All comments welcome!!

Full code below and attached.

import json, re, pprint, time
from urllib2 import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)
    # print('{0} Images total: {1}'.format(url, total))
    return total


if __name__ == "__main__":
    start = time.time()
    json_list = list()
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=<apikey>"
    response = urlopen(url)
    data = json.loads(response.read())

    for i in data["movies"]:
        json_dict = dict()
        json_dict["Title"] = str(i['title'])
        json_dict["url"] = str("http://www.imdb.com/title/tt" + i['alternate_ids']['imdb'])
        json_dict["imdb_id"] = str(i['alternate_ids']['imdb'])
        json_dict["count"] = get_image(str(json_dict["url"]) )
        json_list.append(json_dict)
    end = time.time()
    pprint.pprint(json_list)
    runtime =  end - start
    print "Program runtime: " + str(runtime)

I would like to have it run using python2.7 std libs. No external dependencies. — Alex Daro, Sep 19 '14 at 18:00

score 1 · Answer 1 · answered Sep 19 '14 at 18:06

1

You can't parse HTML with regular expressions. If you can only use standard libraries for Python 2, use HTMLParser:

from HTMLParser import HTMLParser
class ImgFinder(HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'img':
            print 'found img tag, src=', dict(attrs)['src']

parser = ImgFinder()
parser.feed(... HTML source ...)

answered Sep 19 '14 at 18:06

vgel

3,225
1
21
35

Tried this already.... This does not find nested img tags.... and what about img tags generated by javascript? No go. – Alex Daro Sep 19 '14 at 18:58
@DirkDigler *no* parsing schema will be able to parse img tags generated by javascript - because they are simply not in the page source. They only exist in the DOM once a javascript engine runs whatever scripts need to be run. – roippi Sep 19 '14 at 19:40
@roippi Javascript CAN be in page source.... Run this code and tell me if you see Javascript in stdout: page= urlopen("http://www.imdb.com/title/tt2978462").readlines() for line in page: if line: print line – Alex Daro Sep 19 '14 at 20:23
How would someone handle JS in HTML source that looks like this: onclick="(new Image()).src='/rg/help/footer/images/b.gif?link=/help/';" – Alex Daro Sep 19 '14 at 20:30
Of course the raw javascript ` – roippi Sep 19 '14 at 20:34
@roippi, that makes sense. But we still haven't answered the question of nested IMG tags. Check out this fiddle -http://pythonfiddle.com/html-parser-test/ Then view page source source:http://www.imdb.com/title/tt1065073/ and do a simple find for " – Alex Daro Sep 19 '14 at 21:23

score 1 · Answer 2 · answered Sep 19 '14 at 18:07

While you certainly should listen to the general wisdom that it's a bad idea to use regex to parse html (you really should use an html parser) there is a point to be made about the efficiency of your regex.

Compare these two:

>>> timeit('import re; re.findall("<img.*?>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
3.366645097732544
>>> timeit('import re; re.findall("<img[^>]*>", \'blah blah blah <img src="http://www.example.org/test.jpg"> blah blah blah <img src="http://wwww.example.org/test2.jpg"> blah blah blah\')')
2.328295946121216

You can see that the latter regex, which is equivalent, is actually noticeably faster. That's because it doesn't require backtracking. See this great blog post http://blog.stevenlevithan.com/archives/greedy-lazy-performance for an explanation of why that is.

I think `.*?>` is a repetitive two step process. Checking if the character is `>`, if not then consume 1 character (repeats over and over). `[^>]*` immediately find the next `>`, then consume match position up to here. But it can be played with to make it backtrack from `>` if the next character is not `>`. This is useful if you need to search backwards from a starting point where things overlap and no real hard anchors can be used. — , Sep 19 '14 at 18:32
Thank you FatalError: "]*>" Shaved a few seconds off of the runtime. And yes, I agree with you that regex is not the right way to go but I want this to run out of the box with python 2.x so tools like BeautifulSoup or Mechanize are out of the question.... — Alex Daro, Sep 19 '14 at 19:00

score 0 · Accepted Answer · answered Sep 22 '14 at 03:09

Although I know using regex to search for img tags in HTML is not ideal, here is the approached I ended up going with. By threading I was able to get the runtime to anywhere from 2-12 seconds depending on your connection:

#No shebang line, please run in Linux shell % python img_count.py

#Python libs
import threading, urllib2, re
import Queue, json, time, pprint

#Global lists 
JSON_LIST = list()
URLS = list()

def get_movies():
    url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=    <apikey>"
    response = urllib2.urlopen(url)
    data = json.loads(response.read())    
    return data


def get_imgs(html):
    total = 0
    # This next line is not ideal. Would much rather use a lib such as Beautiful Soup for this
    total += len(re.findall(r"<img[^>]*>", html)) 
    return total


def read_url(url, queue):
    data = urllib2.urlopen(url).read()
    queue.put(data)


def fetch_urls():
    result = Queue.Queue()
    threads = [threading.Thread(target=read_url, args = (url,result)) for url in URLS]
    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()
    return result


if __name__ == "__main__":
    start = time.time()
    movies = get_movies()
    for movie in movies["movies"]:
        url = "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb']
        URLS.append(url)    
    queue = fetch_urls()
    while movies["movies"]:
        movie = movies["movies"].pop()
        job = queue.get()
        total = get_imgs(job)    
        json_dict = {
                "title": movie['title'],
                "url": "http://www.imdb.com/title/tt" + movie['alternate_ids']['imdb'],
                "imdb_id": movie['alternate_ids']['imdb'],
                "count": total 
                } 
        JSON_LIST.append(json_dict)      
    pprint.pprint(JSON_LIST)
    end = time.time()
    print "\n"
    print "Elapsed Time (seconds):", end - start

Parsing Large Chunks of HTML with re.findall()

3 Answers3