How to determine the underlying URL of text file download

Question

On the page below there is ability to downlaod a txt file. I'm interested in the first file in the txt section.

How do I get the URL. I can pull it. How I get the url that does not include java script with python.

Today its: volume.20110218.txt.

http://www.optionsclearing.com/webapps/trade-volume-download

There are directions on the format of the actual URLs at http://www.optionsclearing.com/market-data/batch-processing.jsp; that might help you more than scraping. — Jeremiah Willcock, Feb 20 '11 at 23:28

score 1 · Accepted Answer · edited May 23 '17 at 11:56

You're question is a bit vague. It sounds like you'd like to do something with the urllib2 and BeautifulSoup modules.

Fetch the HTML from the base URL with urllib2's functions, parse it with BeautifulSoup and use the target (value of the src attribute) of the (first TXT?) anchor tag in the table to open another connection and pull those contents. Then open your local file (or subprocess) and feed the contents of the second fetch thereto.

The toughest part of using BeautifulSoup is to find the characteristics which uniquely identify the part of the content that you want to extract. Modern HTML is pretty ugly and tends to have lots of extraneous garbage embedded in it by the various tools and libraries which were used to generate it. (One tip: the word "class" is a Python reserved keyword as well as a common attribute in HTML. Thus you'll find it easiest to pass "class" attribute/pattern pairs to BeautifulSoup functions by wrapping them in a dictionary: {'class': some_pattern} rather than in the more common keyword=pattern form that's used for most other arguments).

To handle the javascript you might want to read:

What's a good tool to screen-scrape with Javascript support?

It sounds like your best bet, currently, may be to set up the Java-based HTMLUnit package to serve as a gateway, then write your Python to connect to and control that. You might also try Selenium to control real browser session and extract information from it via inter-process communications mechanisms.

Okay. What you really need is a clear understanding of your own question that would help you find the answers that others have posted to very similar questions. In this case the magic words to search on are: "web-scraper" and "javascript" (adding those tags now; and adding links to the best existing StackOverflow answers to my response). — Jim Dennis, Feb 20 '11 at 23:21

score 1 · Answer 2 · edited May 23 '17 at 12:11

The page uses javascript links to submit a hidden form in order to download the file. The form hidden fields seems to be filled also by javascript.

Seems like they do this in order to make automated download harder to accomplish. If they don't mind automated download, ask them for an easier interface, otherwise, stop trying to do it.

UPDATE: as commented by Jeremiah, they indeed have a batch interface:

http://www.optionsclearing.com/market-data/batch-processing.jsp

How to determine the underlying URL of text file download

2 Answers2