0

I need to scrape a webpage and I normally use scrapy. I need to follow some link that can be opened through javascript and they are nested into some < ul > and < li >.

For example:

<ul class="level1">
   <li class="closed"> <----this become "expanded" when opened
     <a href="javascript:etc...
       <ul class="level2">
         <li class="closed">
           <ul class="level3">
            <li class="track">
              <a href="this_is_the_url_that_I_want">

Now, did I need something else than scrapy (I see that Selenium is suggested) or can I use a XmlLinkExtractor? Or can I, in some ways, use the code to extract the url inside "level3"?

Thanks

EDIT: I'm trying to use selenium but I get " File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 40, in load raise KeyError("Spider not found: {}".format(spider_name)) KeyError: 'Spider not found: '"

I'm naming the spider, so I don't understand what I've done wrong.

import scrapy
from selenium import webdriver

class audioSpider(scrapy.Spider):
    name = "audio"
    allowed_domains = ["http://audio.sample"]
    start_urls = ["http://audio.sample/archive-project"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        el1 = self.driver.find_element_by_xpath('//ul[@class="level1"]/li[@class]/href')
        el1.click()
        el2 = self.driver.find_element_by_xpath('//id[@class="subNavContainer loaded"/ul[@class="level2"]/li[@class]/href')
        el2.click()
        el3 = self.driver.find_element_by_xpath('//id[@class="subNavContainer loaded"/ul[@class="level3"]/li[@class="track"]/href')
        print el3
Lara M.
  • 855
  • 2
  • 10
  • 23

0 Answers0