I need to scrape a webpage and I normally use scrapy. I need to follow some link that can be opened through javascript and they are nested into some < ul > and < li >.
For example:
<ul class="level1">
<li class="closed"> <----this become "expanded" when opened
<a href="javascript:etc...
<ul class="level2">
<li class="closed">
<ul class="level3">
<li class="track">
<a href="this_is_the_url_that_I_want">
Now, did I need something else than scrapy (I see that Selenium is suggested) or can I use a XmlLinkExtractor? Or can I, in some ways, use the code to extract the url inside "level3"?
Thanks
EDIT: I'm trying to use selenium but I get " File "/usr/lib/pymodules/python2.7/scrapy/spiderloader.py", line 40, in load raise KeyError("Spider not found: {}".format(spider_name)) KeyError: 'Spider not found: '"
I'm naming the spider, so I don't understand what I've done wrong.
import scrapy
from selenium import webdriver
class audioSpider(scrapy.Spider):
name = "audio"
allowed_domains = ["http://audio.sample"]
start_urls = ["http://audio.sample/archive-project"]
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
el1 = self.driver.find_element_by_xpath('//ul[@class="level1"]/li[@class]/href')
el1.click()
el2 = self.driver.find_element_by_xpath('//id[@class="subNavContainer loaded"/ul[@class="level2"]/li[@class]/href')
el2.click()
el3 = self.driver.find_element_by_xpath('//id[@class="subNavContainer loaded"/ul[@class="level3"]/li[@class="track"]/href')
print el3