Newbie Python Regex question: Pulling dates from webpage

Question

I'm looking to use Python to pull a regular string of text from a webpage - the source code runs like this:

<br /><strong>Date: 06/12/2010</strong> <br />

It always begins

<strong>Date:

& ends

</strong>

I've already scraped the text of the webpage and just want to pull the date and similarly structured information. Any suggestions how to do this? (Sorry this is such a newbie question!)

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Katriel, Dec 16 '10 at 16:15

nmichaels · Answer 1 · 2010-12-16T16:19:59.607

3

You can use a regular expression:

import re
pattern = re.compile(r'<strong>Date:(?P<date>.*?)</strong>') # re.MULTILINE?
# Then use it with
pattern.findall(text) # Returns all matches
# or
match = pattern.search(text) # grabs the first match
match.groupdict() # gives a dictionary with key 'date'
# or
match.groups()[0] # gives you just the text of the match.

or try to parse the thing with beautiful soup.

This is a good place to test out your Python regexes.

edited Dec 16 '10 at 16:19

answered Dec 16 '10 at 16:11

nmichaels

49,466
12
107
135

1

It gives the group a name (date.) It's not strictly necessary; you could leave out `?P`, but then `match.groupdict()` wouldn't work. Look for `?P<` on http://docs.python.org/library/re.html – nmichaels Dec 16 '10 at 16:44

Rod · Accepted Answer · 2010-12-16T16:28:29.703

1

import re

text = "<br /><strong>Date: 06/12/2010</strong> <br />"
m = re.search("<strong>(Date:.*?)</strong>", text)
print m.group(1)

Output

Date: 06/12/2010

edited Dec 16 '10 at 16:28

answered Dec 16 '10 at 16:11

Rod

52,748
3
38
55

Yet another one gets bitten by greediness... this will give you a really large group spanning everything from the first `Data:` to the last ``. – Dec 16 '10 at 16:13
@nmichaels. I just did while you were commenting. :) – Rod Dec 16 '10 at 16:16
Thanks hugely - that worked for me. Now trying to work out what the `.group(1)` bit actually means so I can assign the results to a dictionary. Thanks again! – Paul Bradshaw Dec 16 '10 at 16:35

Newbie Python Regex question: Pulling dates from webpage

2 Answers2