0

I am trying to retrieve the ad URLs for this website: http://www.appledaily.com

The ad URLs are loaded using javascript so a standard crawlspider does not work. The ads also changes as you refresh the page.

I found this question here and what I gathered is that we need to first use selenium to load a page in the browser then use Scrapy to retrieve the url. I have some experiences with scrapy but none at all in using Selenium. Can anyone show/point me to resource on how I can write a script to do that?

Thank you very much!

EDIT:

I tried the following but neither works in opening the ad banner. Can anyone help?

from selenium import webdriver driver=webdriver.Firefox()
driver=webdriver.Firefox()
driver.get('http://appledaily.com')

adBannerElement = driver.find_element_by_id('adHeaderTop') 
adBannerElement.click()

2nd try:

adBannerElement =driver.find_element_by_css_selector("div[@id='adHeaderTop']")
adBannerElement.click()
Community
  • 1
  • 1
Onyi Lam
  • 147
  • 1
  • 2
  • 10
  • Check out this link, should help you get started - http://stackoverflow.com/questions/17975471/selenium-with-scrapy-for-dynamic-page – DRVaya Mar 13 '15 at 08:03
  • thanks, i looked through it but am still stuck. please see my edit – Onyi Lam Mar 15 '15 at 17:29

1 Answers1

1

CSS Selector should not contain @ symbol it should be 'div[id='adHeaderTop']' or a shorter way of representing the same as div#adHeaderTop

Actually on observing and analyzing the site and the event that you are trying to carry out, I find that the noscript tag is what should interest you. Just get the HTML source of this node, parse the href attribute and fire this URL. It will be equivalent to clicking the banner.

  <noscript>
   "<a href="http://adclick.g.doubleclick.net/aclk%253Fsa%...</a>"
  </noscript>

(This is not the complete node information, just inspect the banner in Chrome and you will find this tag).

EDIT: Here is a working snippet that gives you the URL without clicking on the Ad banner, as mentioned from the tag.

    driver = new FirefoxDriver();
    driver.navigate().to("http://www.appledaily.com");

    WebElement objHidden = driver.findElement(By.cssSelector("div#adHeaderTop_ad_container noscript"));
    if(objHidden != null){
        String innerHTML = objHidden.getAttribute("innerHTML");
        String adURL = innerHTML.split("\"")[1];
        System.out.println("** " + adURL); ///URL when you click on the Ad
    }
    else{
        System.out.println("<noscript> element not found...");
    }

Though this is written in Java, the page source wont change.

DRVaya
  • 443
  • 4
  • 13