1

I'm looking to use Python to pull a regular string of text from a webpage - the source code runs like this:

<br /><strong>Date: 06/12/2010</strong> <br />

It always begins

<strong>Date: 

& ends

</strong>

I've already scraped the text of the webpage and just want to pull the date and similarly structured information. Any suggestions how to do this? (Sorry this is such a newbie question!)

  • http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Katriel Dec 16 '10 at 16:15

2 Answers2

3

You can use a regular expression:

import re
pattern = re.compile(r'<strong>Date:(?P<date>.*?)</strong>') # re.MULTILINE?
# Then use it with
pattern.findall(text) # Returns all matches
# or
match = pattern.search(text) # grabs the first match
match.groupdict() # gives a dictionary with key 'date'
# or
match.groups()[0] # gives you just the text of the match.

or try to parse the thing with beautiful soup.

This is a good place to test out your Python regexes.

nmichaels
  • 49,466
  • 12
  • 107
  • 135
  • 1
    It gives the group a name (date.) It's not strictly necessary; you could leave out `?P`, but then `match.groupdict()` wouldn't work. Look for `?P<` on http://docs.python.org/library/re.html – nmichaels Dec 16 '10 at 16:44
1
import re

text = "<br /><strong>Date: 06/12/2010</strong> <br />"
m = re.search("<strong>(Date:.*?)</strong>", text)
print m.group(1)

Output

Date: 06/12/2010
Rod
  • 52,748
  • 3
  • 38
  • 55
  • Yet another one gets bitten by greediness... this will give you a really large group spanning everything from the first `Data:` to the last ``. –  Dec 16 '10 at 16:13
  • @nmichaels. I just did while you were commenting. :) – Rod Dec 16 '10 at 16:16
  • Thanks hugely - that worked for me. Now trying to work out what the `.group(1)` bit actually means so I can assign the results to a dictionary. Thanks again! – Paul Bradshaw Dec 16 '10 at 16:35