21

how can I find all span's with a class of 'blue' that contain text in the format:

04/18/13 7:29pm

which could therefore be:

04/18/13 7:29pm

or:

Posted on 04/18/13 7:29pm

in terms of constructing the logic to do this, this is what i have got so far:

new_content = original_content.find_all('span', {'class' : 'blue'}) # using beautiful soup's find_all
pattern = re.compile('<span class=\"blue\">[data in the format 04/18/13 7:29pm]</span>') # using re
for _ in new_content:
    result = re.findall(pattern, _)
    print result

I've been referring to https://stackoverflow.com/a/7732827 and https://stackoverflow.com/a/12229134 to try and figure out a way to do this, but the above is all i have got so far.

edit:

to clarify the scenario, there are span's with:

<span class="blue">here is a lot of text that i don't need</span>

and

<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>

and note i only need 04/18/13 7:29pm not the rest of the content.

edit 2:

I also tried:

pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
for _ in new_content:
    result = re.findall(pattern, _)
    print result

and got error:

'TypeError: expected string or buffer'
Community
  • 1
  • 1
user1063287
  • 10,265
  • 25
  • 122
  • 218

3 Answers3

33
import re
from bs4 import BeautifulSoup

html_doc = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
</body>
</html>
"""

# parse the html
soup = BeautifulSoup(html_doc)

# find a list of all span elements
spans = soup.find_all('span', {'class' : 'blue'})

# create a list of lines corresponding to element texts
lines = [span.get_text() for span in spans]

# collect the dates from the list of lines using regex matching groups
found_dates = []
for line in lines:
    m = re.search(r'(\d{2}/\d{2}/\d{2} \d+:\d+[a|p]m)', line)
    if m:
        found_dates.append(m.group(1))

# print the dates we collected
for date in found_dates:
    print(date)

output:

04/18/13 7:29pm
04/19/13 7:30pm
04/20/13 10:31pm
Corey Goldberg
  • 59,062
  • 28
  • 129
  • 143
  • i could successfully run exact code above, but it was not working in my implementation. i thought it might be because there is a ` ` between date and time in the original source code eg `04/18/13 7:29pm`. for reference, i added `.replace(" "," ")` to the original `'urlopen read object'` and it worked. thank you very much (to all responders!). – user1063287 Apr 27 '13 at 06:37
4

This is a flexible regex that you can use:

"(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])"

Example:

>>> import re
>>> from bs4 import BeautifulSoup
>>> html = """
<html>
<body>
<span class="blue">here is a lot of text that i don't need</span>
<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>
<span class="blue">04/19/13 7:30pm</span>
<span class="blue">04/18/13 7:29pm</span>
<span class="blue">Posted on 15/18/2013 10:00AM</span>
<span class="blue">Posted on 04/20/13 10:31pm</span>
<span class="blue">Posted on 4/1/2013 17:09aM</span>
</body>
</html>
"""
>>> soup = BeautifulSoup(html)
>>> lines = [i.get_text() for i in soup.find_all('span', {'class' : 'blue'})]
>>> ok = [m.group(1)
      for line in lines
        for m in (re.search(r'(\d\d?/\d\d?/\d\d\d?\d?\s*\d\d?:\d\d[a|p|A|P][m|M])', line),)
          if m]
>>> ok
[u'04/18/13 7:29pm', u'04/19/13 7:30pm', u'04/18/13 7:29pm', u'15/18/2013 10:00AM', u'04/20/13 10:31pm', u'4/1/2013 17:09aM']
>>> for i in ok:
    print i

04/18/13 7:29pm
04/19/13 7:30pm
04/18/13 7:29pm
15/18/2013 10:00AM
04/20/13 10:31pm
4/1/2013 17:09aM
pradyunsg
  • 18,287
  • 11
  • 43
  • 96
2

This pattern seems to satisfy what you're looking for:

>>> pattern = re.compile('<span class="blue">.*?(\d\d/\d\d/\d\d \d\d?:\d\d\w\w)</span>')
>>> pattern.match('<span class="blue">here is a lot of text that i dont need</span>')
>>> pattern.match('<span class="blue">this is the span i need because it contains 04/18/13 7:29pm</span>').groups()
('04/18/13 7:29pm',)
Nolen Royalty
  • 18,415
  • 4
  • 40
  • 50
  • i don't know how to implement this, i posted the code i attempted based on your suggestion into original post (see edit 2). – user1063287 Apr 27 '13 at 05:58
  • 1
    @user1063287 try changing your third line to `result = pattern.match(_).groups()`. `re.findall` expects a string(like the string that you use earlier when you call `re.compile` and instead you're giving it an already compiled regex. Essentially you're trying to compile your pattern twice. – Nolen Royalty Apr 27 '13 at 05:59
  • i get `'TypeError: expected string or buffer'` – user1063287 Apr 27 '13 at 06:01
  • 1
    It sounds like `_` isn't a string yet, you're gonna need to extract the actual string from your `_` variable before you can use a regex on it. I'd assume you can call something like `_.string`, try some print statements like `print _` and `print dir(_)` in order to figure out what kind of object you're working with right now. – Nolen Royalty Apr 27 '13 at 06:04
  • 1
    @user1063287 Corey's answer gives you a much more comprehensive explanation of how to do this, the method you needed to call on `_` was `get_text()`. But he provides a much more complete answer :) – Nolen Royalty Apr 27 '13 at 06:08
  • i will look at that answer now, in answer to your suggestions though: `print type(_)` gives ``, `print _` gives `Text and Text`, `print dir(_)` gives a list of lots of things. if i add `_ = _.string` above `result = pattern.match(_).groups()` and try and print `_`, i get `AttributeError: 'NoneType' object has no attribute 'groups'`. – user1063287 Apr 27 '13 at 06:14
  • 1
    The `AttributeError` you're getting is from when the regex doesn't match a string, so it returns `None`. This causes the code to call `None.groups()` which doesn't exist. Corey's code accounts for this with his line `if m:` which is why I directed you to his code. Hope this helps! – Nolen Royalty Apr 27 '13 at 06:16