I have a project that I am working on at home that uses the rottentomatoes API to gather movies currently in theaters. It then gathers all images on those movies' imdb page. The issue I am having trouble with is the gathering of the images.. The goal here is to get this code to run under 8 seconds, but the regex command and am running is taking forever! Currently I am using a regular expression:
re.findall('<img.*?>', str(line))
where line is a chunk of HTML
Does anyone have a better regex expression that they can think of (perhaps more refined?) All comments welcome!!
Full code below and attached.
import json, re, pprint, time
from urllib2 import urlopen
def get_image(url):
total = 0
page = urlopen(url).readlines()
for line in page:
hit = re.findall('<img.*?>', str(line))
total += len(hit)
# print('{0} Images total: {1}'.format(url, total))
return total
if __name__ == "__main__":
start = time.time()
json_list = list()
url = "http://api.rottentomatoes.com/api/public/v1.0/lists/movies/in_theaters.json?apikey=<apikey>"
response = urlopen(url)
data = json.loads(response.read())
for i in data["movies"]:
json_dict = dict()
json_dict["Title"] = str(i['title'])
json_dict["url"] = str("http://www.imdb.com/title/tt" + i['alternate_ids']['imdb'])
json_dict["imdb_id"] = str(i['alternate_ids']['imdb'])
json_dict["count"] = get_image(str(json_dict["url"]) )
json_list.append(json_dict)
end = time.time()
pprint.pprint(json_list)
runtime = end - start
print "Program runtime: " + str(runtime)