You're question is a bit vague. It sounds like you'd like to do something with the urllib2
and BeautifulSoup
modules.
Fetch the HTML from the base URL with urllib2
's functions, parse it with BeautifulSoup
and use the target (value of the src
attribute) of the (first TXT?) anchor tag in the table to open another connection and pull those contents. Then open your local file (or subprocess) and feed the contents of the second fetch thereto.
The toughest part of using BeautifulSoup is to find the characteristics which uniquely identify the part of the content that you want to extract. Modern HTML is pretty ugly and tends to have lots of extraneous garbage embedded in it by the various tools and libraries which were used to generate it. (One tip: the word "class" is a Python reserved keyword as well as a common attribute in HTML. Thus you'll find it easiest to pass "class" attribute/pattern pairs to BeautifulSoup functions by wrapping them in a dictionary: {'class': some_pattern}
rather than in the more common keyword=pattern
form that's used for most other arguments).
To handle the javascript you might want to read:
What's a good tool to screen-scrape with Javascript support?
It sounds like your best bet, currently, may be to set up the Java-based HTMLUnit package to serve as a gateway, then write your Python to connect to and control that. You might also try Selenium to control real browser session and extract information from it via inter-process communications mechanisms.