1
</span>
                    <div class="clearB paddingT5px"></div>
                    <small>
                        10/12/2015 5:49:00 PM -  Seeking Alpha
                    </small>
                    <div class="clearB paddingT10px"></div>

Suppose i have a source code of a website, a part of which looks like this. I am trying to get the line between "small" and "/small". In the entire webpage there are many such lines, enveloped between "small" and "/small". i want to extract all lines which are between "small" and "/small".

I am trying to use a 'regex' function which looks like this

regex = '<small>(.+?)</small>'
datestamp = re.compile(regex)
urls = re.findall(datestamp, htmltext)

This returns only a blank space. Please advise me on this.

Martin Evans
  • 45,791
  • 17
  • 81
  • 97
M PAUL
  • 1,228
  • 2
  • 13
  • 21

2 Answers2

2

Here are two ways you could approach this:

Firstly using a regular expression, not recommended:

import re

html = """</span>
    <div class="clearB paddingT5px"></div>
    <small>
        10/12/2015 5:49:00 PM -  Seeking Alpha
    </small>
    <div class="clearB paddingT10px"></div>"""

for item in re.findall('\<small\>\s*(.*?)\s*\<\/small\>', html, re.I+re.M):
    print '"{}"'.format(item)

Secondly using something like BeautifulSoup to parse the HTML for you:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")
for item in soup.find_all("small"):
    print '"{}"'.format(item.text.strip())

Giving the following output for both:

"10/12/2015 5:49:00 PM -  Seeking Alpha"
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
0

use xml.etree here. With that you can take the html data from the webpage and return whatever tag you wish using urllib2.....like so.

import urllib2
from xml.etree import ElementTree

url = whateverwebpageyouarelookingin
request = urllib2.Request(url, headers={"Accept" : "application/xml"})
u = urllib2.urlopen(request)
tree = ElementTree.parse(u)
rootElem = tree.getroot()
yourdata = rootElem.findall("small")  
print yourdata
Amazingred
  • 1,007
  • 6
  • 14