0

I am trying to download all the files from this website for backup and mirroring, however I don't know how to go about parsing the JavaScript links correctly.

I need to organize all the downloads in the same way in named folders. For example on the first one I would have a folder named "DAP-1150" and inside that would be a folder named "DAP-1150 A1 FW v1.10" with the file "DAP1150A1_FW110b04_FOSS.zip" in it and so on for each file. I tried using beautifulsoup in Python but it didn't seem to be able to handle ASP links properly.

halfer
  • 19,824
  • 17
  • 99
  • 186
jhilliar
  • 226
  • 2
  • 8
  • I think Selenium is probably overkill for this. I noticed that once you've clicked on a link, it does a POST submit (because the resultant page cannot be refreshed without asking the user). Thus, work out what clicking the link does - it probably inserts a value into a form and submits it. All you need to do in your scraping system is to emulate that, using the scraped links to work out what inputs you need. – halfer Oct 22 '13 at 17:31
  • Yeah, I looks to me like scrappy is the way to go, I need to create folder structure for the downloads and generate a full list of downloads and paths that I can queue up and update when there are changes. – jhilliar Oct 22 '13 at 18:40
  • I've been trying a number of things but am still unable to get anything working, this website just seems weird. I think I need some sort of scraper that hooks into jquery properly but I don't have any idea how to do that. I can trace all the calls that get made in chrome using the timeline functionality but I don't know how you would adapt that into scrapy or something similar. – jhilliar Oct 23 '13 at 11:21
  • No, you don't need to hook into jQuery, or use JavaScript, at all. See [my answer here](http://stackoverflow.com/a/19333797/472495). – halfer Oct 23 '13 at 11:26
  • I think I just now got a usable xpath in scrapy using a chrome addon "//strong/a" gives all the links at least. – jhilliar Oct 23 '13 at 11:31

2 Answers2

0

When you struggle with Javascript links you can give Selenium a try: http://selenium-python.readthedocs.org/en/latest/getting-started.html

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get("http://www.python.org")
time.sleep(3)   # Give your Selenium some time to load the page
link_elements = driver.find_elements_by_tag_name('a')
links = [link.get_attribute('href') for link in links]

You can use the links and pass them to urllib2 to download them accordingly. If you need more than a script, I can recommend you a combination of Scrapy and Selenium: selenium with scrapy for dynamic page

Community
  • 1
  • 1
Jon
  • 11,356
  • 5
  • 40
  • 74
0

Here's what it is doing. I just used the standard Network inspector in Firefox to snapshot the POST operation. Bear in mind, like my other answer I pointed you to, this is not a particularly well-written website - JS/POST should not have been used at all.

First of all, here's the JS - it's very simple:

function oMd(pModel_,sModel_){
obj=document.form1;
obj.ModelCategory_.value=pModel_;
obj.ModelSno_.value=sModel_;
obj.Model_Sno.value='';
obj.ModelVer.value='';
obj.action='downloads2008detail.asp';
obj.submit();
}

That writes to these fields:

<input type=hidden name=ModelCategory_ value=''>
<input type=hidden name=ModelSno_ value=''>

So, you just need a POST form, targetting this URL:

http://tsd.dlink.com.tw/downloads2008detail.asp

And here's an example set of data from FF's network analyser. There's only two items you need change - grabbed from the JS link - and you can grab those with an ordinary scrape:

  • Enter=OK
  • ModelCategory=0
  • ModelSno=0
  • ModelCategory_=DAP
  • ModelSno_=1150
  • Model_Sno=
  • ModelVer=
  • sel_PageNo=1
  • OS=GPL

You'll probably find by experimentation that not all of them are necessary. I did try using GET for this, in the browser, but it looks like the target page insists upon POST.

Don't forget to leave a decent amount of time inside your scraper between clicks and submits, as each one represents a hit on the remote server; I suggest 5 seconds, emulating a human delay. If you do this too quickly - all too possible if you are on a good connection - the remote side may assume you are DoSing them, and might block your IP. Remember the motto of scraping: be a good robot!

Community
  • 1
  • 1
halfer
  • 19,824
  • 17
  • 99
  • 186
  • Using hxs.select("//strong/a/@href").extract() from the scrapy shell I seem to be able to get that information needed for a single page. However the same path does not seem to work for scraping page numbers. For that I got "//tr[21]/td/table/tbody/tr/td/a" but it appears scrapy doesn't interpreter "tbody" at all, is there another way to go about finding the correct xpath for scrapy? – jhilliar Oct 23 '13 at 12:06
  • Just figured that out, looks like I can modify the page requests like this http://tsd.dlink.com.tw/downloads2008list.asp?t=1&OS=GPL&SourceType=download&PageNo=3 to increment the pages – jhilliar Oct 23 '13 at 12:36