0
  1. I need to parse a HTML page, get all the URLs meeting my requirement.
  2. Now, I need to parse each of the extracted URLs to get the data that I want, if the page title matches something and save them to multiple files based on their names. I have done part 1 in the following way.

    pattern=re.compile(r'''class="topline"><A href="(.*?)"''')
    da = pattern.search(web_page)
    da = pattern.findall(soup1)
    col_width = max(len(word) for row in da for word in row)
    for row in da:
        if "some string" in row.upper():
            bb = "".join(row.ljust(col_width))
            print >> links, bb
    

I'd truly appreciate any help. Thank you.

user3783999
  • 571
  • 2
  • 7
  • 17
  • 1
    Use BeutifulSoup or any other library for parsing HTML, do not use regex. http://stackoverflow.com/a/1732454/969365 – simon Jul 14 '14 at 18:15

1 Answers1

2

First of all, do not parse HTML with regex. You've actually marked the question with BeautifulSoup tag, but you are still using regular expressions here.

Here's how you can get the links, follow them and check the title:

from urllib2 import urlopen
from bs4 import BeautifulSoup

URL = "url here"

soup = BeautifulSoup(urlopen(URL))
links = soup.select('.topline > a')
for a in links:
    link = link.get('href')
    if link:
        # follow link
        link_soup = BeautifulSoup(urlopen(link))
        title = link_soup.find('title')
        # check title

.topline > a CSS selector would find you any tag with topline class and get the a tag right beneath.

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195