How to yield fragment URLs in scrapy using Selenium?

Question

from my poor knowledge about webscraping I've come about to find a very complex issue for me, that I will try to explain the best I can (hence I'm opened to suggestions or edits in my post).

I started using the web crawling framework 'Scrapy' long ago to make my webscraping, and it's still the one that I use nowadays. Lately, I came across this website, and found that my framework (Scrapy) was not able to iterate over the pages since this website uses Fragment URLs (#) to load the data (the next pages). Then I made a post about that problem (having no idea of the main problem yet): my post

After that, I realized that my framework can't make it without a JavaScript interpreter or a browser imitation, so they mentioned the Selenium library. I read as much as I could about that library (i.e. example1, example2, example3 and example4). I also found this StackOverflow's post that gives some tracks about my issue.

So Finally, my biggest questions are:

1 - Is there any way to iterate/yield over the pages from the website shown above, using Selenium along with scrapy? So far, this is the code I'm using, but doesn't work...

EDIT:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

# The require imports...

def getBrowser():
    path_to_phantomjs = "/some_path/phantomjs-2.1.1-macosx/bin/phantomjs"
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
    "(KHTML, like Gecko) Chrome/15.0.87")
    browser = webdriver.PhantomJS(executable_path=path_to_phantomjs, desired_capabilities=dcap)

    return browser

class MySpider(Spider):
    name = "myspider"

    browser = getBrowser()

    def start_requests(self):
        the_url = "http://www.atraveo.com/es_es/islas_canarias#eyJkYXRhIjp7ImNvdW50cnlJZCI6IkVTIiwicmVnaW9uSWQiOiI5MjAiLCJkdXJhdGlvbiI6NywibWluUGVyc29ucyI6MX0sImNvbmZpZyI6eyJwYWdlIjoiMCJ9fQ=="

        yield scrapy.Request(url=the_url, callback=self.parse, dont_filter=True)

    def parse(self, response):
        self.get_page_links()

    def get_page_links(self):
        """ This first part, goes through all available pages """

        for i in xrange(1, 3):  # 210
            new_data = {"data": {"countryId": "ES", "regionId": "920", "duration": 7, "minPersons": 1},
                    "config": {"page": str(i)}}
            json_data = json.dumps(new_data)
            new_url = "http://www.atraveo.com/es_es/islas_canarias#" + base64.b64encode(json_data)
            self.browser.get(new_url)
            print "\nThe new URL is -> ", new_url, "\n"
            content = self.browser.page_source
            self.get_item_links(content)

    def get_item_links(self, body=""):
        if body:
            """ This second part, goes through all available items """
            raw_links = re.findall(r'listclickable.+?>', body)
            links = []
            if raw_links:
                for raw_link in raw_links:
                    new_link = re.findall(r'data-link=\".+?\"', raw_link)[0].replace("data-link=\"", "").replace("\"",
                                                                                                             "")
                    links.append(str(new_link))

                if links:
                    ids = self.get_ids(links)
                    for link in links:
                        current_id = self.get_single_id(link)
                        print "\nThe Link -> ", link
                        # If commented the line below, code works, doesn't otherwise
                        yield scrapy.Request(url=link, callback=self.parse_room, dont_filter=True)                                                                           

    def get_ids(self, list1=[]):
        if list1:
            ids = []
            for elem in list1:
                raw_id = re.findall(r'/[0-9]+', elem)[0].replace("/", "")
                ids.append(raw_id)

            return ids

        else:
            return []

    def get_single_id(self, text=""):
        if text:
            raw_id = re.findall(r'/[0-9]+', text)[0].replace("/", "")
            return raw_id

        else:
            return ""

    def parse_room(self, response): 
        # More scraping code...

So this is mainly my problem. I'm almost sure that what I'm doing isn't the best way, so for that I did my second question. And to avoid having to do these kind of issues in the future, I did my third question.

2 - If the answer to the first question is negative, how could I tackle this issue? I'm opened to another means, otherwise

3 - Can anyone tell me or show me pages where I can learn how to solve/combine webscraping along javaScript and Ajax? Nowadays are more the websites that use JavaScript and Ajax scripts to load content

Many thanks in advance!

Possible duplicate of [Using Selenium + Scrapy](https://stackoverflow.com/questions/41571456/using-selenium-scrapy) — parik, Oct 06 '17 at 16:51

score 3 · Answer 1 · answered Oct 06 '17 at 12:59

3

Selenium is one of the best tools to scrape dynamic data.you can use selenium with any web browser to fetch the data that is loading from scripts.That works exactly like the browser click operations.But I am not prefering it.

For getting dynamic data you can use scrapy + splash combo. From scrapy you wil get all the static data and splash for other dynamic contents.

answered Oct 06 '17 at 12:59

Arun Augustine

1,690
1
13
20

thank you so much for your answer. However, I've done some edits to my post in order to have more specific answers, take a look please! @Arun Augustine :) – wj127 Oct 09 '17 at 14:59

score 1 · Answer 2 · answered Oct 06 '17 at 11:08

1

Have you looked into BeautifulSoup? It's a very popular web scraping library for python. As for JavaScript, I would recommend something like Cheerio (If you're asking for a scraping library in JavaScript)

If you are meaning that the website uses HTTP requests to load content, you could always try to manipulate that manually with something like the requests library.

Hope this helps

answered Oct 06 '17 at 11:08

JC1

849
13
25

thanks for the answer mate, but I've made some edits to my post, please take a look again if you want @JC1 – wj127 Oct 09 '17 at 14:56

score 1 · Answer 3 · answered Oct 06 '17 at 12:26

You can definitely use Selenium as a standalone to scrap webpages with dynamic content (like AJAX loading).

Selenium will just rely on a WebDriver (basically a web browser) to seek content over the Internet.

Here are a few of them (but the most often used) :

ChromeDriver
PhantomJS (my favorite)
Firefox

Once your started, you can start your bot and parse the html content of the webpage.

I included a minimal working example below using Python and ChromeDriver :

from selenium import webdriver
from selenium.webdriver.common.by import By


driver = webdriver.Chrome(executable_path='chromedriver')
driver.get('https://www.google.com')
# Then you can search for any element you want on the webpage
search_bar = driver.find_element(By.CLASS_NAME, 'tsf-p')
search_bar.click()
driver.close()

See the documentation for more details !

I knew more or less what you have written. I've made some edits to my post, please, check it out again @rak007 — wj127, Oct 09 '17 at 14:57

How to yield fragment URLs in scrapy using Selenium?

3 Answers3

Linked