Find HTML between two variables using Python regex

Question

New to Python

Trying to scrape some desired info from a webpage. First thing I would like to get is all HTML between today and yesterday's dates. Here is what I have so far

import datetime
import urllib
import re

t = datetime.date.today()
t1 = t.strftime("%B %d, %Y")
y = datetime.date.today() - datetime.timedelta(1)
y1 = y.strftime("%B %d, %Y")

htmlfile = urllib.urlopen("http://www.blu-ray.com/itunes/movies.php?show=newreleases")
htmltext = htmlfile.read()

block1 = re.search(t1 + r'(.*)' + re.escape(y1), htmltext)
print block1

From what I can tell (and I'm probably wrong), my regex should grab what I want it to, so that I can then start pulling out info from today's date only. But it returns 'None'.

I'm sure that it's just my limited understanding as I am new to this but any help would be greatly appreciated. Thanks a lot!

The problem is that `.*` doesn't match line breaks. But you really should use a HTML parser, like alecxe said. — Aran-Fey, Dec 17 '14 at 20:38

score 2 · Accepted Answer · edited May 23 '17 at 12:20

Don't use regular expression for parsing HTML, use an HTML Parser, like BeautifulSoup.

This would be a lot of code, but the idea is to iterate over all h3 elements that contain the date in the specified format (%B %d, %Y), then get all next table tags until we hit an another h3 tag or an end:

from datetime import datetime
import urllib
from bs4 import BeautifulSoup

data = urllib.urlopen("http://www.blu-ray.com/itunes/movies.php?show=newreleases")
soup = BeautifulSoup(data)

def is_date(d):
    try:
        datetime.strptime(d, '%B %d, %Y')
        return True
    except (ValueError, TypeError):
        return False

for date in soup.find_all('h3', text=is_date):
    print date.text

    for element in date.find_next_siblings(['h3', 'table']):
        if element.name == 'h3':
            break

        print element.a.get('title')
    print "----"

Prints:

December 17, 2014
App (2013)
----
December 16, 2014
The Equalizer (2014)
Annabelle (2014)
A Walk Among the Tombstones (2014)
The Guest (2014)
Men, Women & Children (2014)
At the Devil's Door (2014)
The Canal (2014)
The Bitter Tears of Petra von Kant (1972)
Avatar (2009)
Atlas Shrugged Part III: Who Is John Galt? (2014)
Expelled (2014)
Level Five (1997)
The Device (2014)
Two-Bit Waltz (2014)
The Devil's Hand (2014)
----
December 15, 2014
Star Trek: The Next Generation, Season 6 (1992-1993)
Ristorante Paradiso, Season 1 (2009)
A Certain Magical Index II, Season 2, Pt. 2 (2011)
Cowboy Bebop, The Complete Series (1998-1999)

Feel free to ask additional questions about the posted code - would be glad to explain.

score 0 · Answer 2 · edited May 23 '17 at 12:12

Your code was throwing an error on t.strftime("%B %d, %Y").

The correct format for the line is t1 = strftime("%B %d, %Y", t)

I was also getting: TypeError: argument must be 9-item sequence, not datetime.datetime

From this error, you can search for many solutions. I don't know which version of Python you're using, but the solutions use the entire time, not just the date. So you probably need to get the time and subtract a day.

See here: Extract time from datetime and determine if time (not date) falls within range?

And here: How can I generate POSIX values for yesterday and today at midnight in Python?

Find HTML between two variables using Python regex

2 Answers2