5

I'm interested to extract just the XHR's urls, not every url in a webpage: screenshot reference

Thats my code to extract every url in a page:

import scrapy
import json
from scrapy.selector import HtmlXPathSelector

from scrapy.spiders import CrawlSpider, Rule, Spider
from scrapy.linkextractors import LinkExtractor

class test(CrawlSpider):
    name = 'test'
    start_urls = ['SomeURL']
    filename = 'test.txt'

rules = (

    Rule(LinkExtractor(allow=('', )) ,callback='parse_item'),
)

def parse_item(self, response):
    # hxs = HtmlXPathSelector(response)
    with open ('test.txt', 'a') as f:
        f.write (response.url + '\n' )

Thanks,

Edited: Hi, thanks for the comments. After more research I came across this : Scraping ajax pages using python What I want is to do this answer automatically. I need to to this for a big amount of web pages and manually inserting the urls isn't an option. Is there a way to do that? Listen to the XHR requests of a site and save the urls?

Community
  • 1
  • 1
  • Please mention the error/doubt. Your question is unclear. – Rahul Mar 04 '16 at 16:22
  • I'm not familiar with XHRs (though I just read up on them), can you tell the difference between an XHR and a 'normal' URL? Maybe the 'api.'? – Steve Mar 04 '16 at 16:57
  • Have you considered using http://www.crummy.com/software/BeautifulSoup/bs4/doc/ ? this is a great way to parse html/xml docs... – Kepedizer Mar 05 '16 at 21:46
  • XmlHttpRequests ("AJAX") URLs are dynamically created and fetched by JavaScript. You will not be able to find those URLs in the scraped page (try viewing the source of the page: do you see the URLs in the html markup? if that's not the case, scrapy cannot see them either). What this means is that you'll need another way of scraping. Either reverse engineering the sequence of XHR URLs fetched by the page, or using a "headless" browser like [Selenium](http://www.seleniumhq.org/) or [PhantomJS](http://phantomjs.org/). – Greg Sadetsky Mar 05 '16 at 21:59
  • @GregSadetsky I agree that a headless browser is the way ahead. We will have to force all requests to go through a proxy and then filter them out based on their request type, correct ? – fixxxer Mar 23 '17 at 10:28
  • I'm not sure that I follow why you would need to use a proxy (this overcomplicates things). The simplest solution is probably figuring out the pattern of AJAX URLs and scraping those. – Greg Sadetsky Mar 24 '17 at 01:25
  • I'm also interested to extract just the XHR's url from a webpage. I can't figure out the pattern of AJAX URLs as it is too complicated. – Fxs7576 Apr 12 '17 at 18:02

1 Answers1

1

There is no reliable, single way to get "the AJAX URL" of a web page. Web pages can use any number of AJAX URLs (and most of them would not be the ones you are looking for), and they can be triggered of very different ways.

Also, the URLs themselves are seldom useful, each one can returns any kind of data, and its is usually the data that you are interested in.

You should find AJAX URLs manually on a website-by-website basis.

Gallaecio
  • 3,620
  • 2
  • 25
  • 64