1

I'm using scrapy and I'm trying to test my selector using scrapy shell but nothing is working. I'm trying to scrape the JSON data on this website.

https://web.archive.org/web/20180604230058/https://api.simon.com/v1.2/tenant?mallId=231&key=40A6F8C3-3678-410D-86A5-BAEE2804C8F2&lw=true

I've tried to scrape the data using the selector

   response.css("body > pre::text").extract()

However, this doesn't seem to be working. Not sure what's wrong...

Ideally, I just want to get all the "Name: XXX" elements from the JSON data. So If you know how to select those specifically, that would be very helpful as well!

Currently my code looks like this

    # -*- coding: utf-8 -*-
    import scrapy # needed to scrape
    import sys    # need to import xlrd
    sys.path.extend("/Users/YoungFreeesh/anaconda3/lib/python3.6/site- 
    packages/") # needed to import xlrd
    import xlrd   # used to easily import xlsx file 

    class AmazonbotSpider(scrapy.Spider):
        name = 'ArchiveSpider'

        allowed_domains = ['web.archive.org']
        start_urls =['https://web.archive.org/web/20180604230058/https://api.simon.com/v1.2/tenant?mallId=231&key=40A6F8C3-3678-410D-86A5-BAEE2804C8F2&lw=true']

        def parse(self, response):
            print(response.body)
WhiteDillPickle
  • 175
  • 1
  • 4
  • 9
  • Re: "this doesn't seem to be working" — not sure anyone is a mind reader here. I could be wrong though... – l'L'l Jun 11 '18 at 20:16
  • I checked the networks log and it loads the json file from this url https://web.archive.org/web/20180604230058if_/https://api.simon.com/v1.2/tenant?mallId=231&key=40A6F8C3-3678-410D-86A5-BAEE2804C8F2&lw=true .. Difference between both urls is 'if_'. See if this pattern matches with other links you have. You can use this hack to get your data. – sP_ Jun 11 '18 at 20:19
  • @SP_ Thanks! That works. – WhiteDillPickle Jun 11 '18 at 20:53

1 Answers1

1

Since the content is inside an iframe, it is a separate page, you have to navigate to the iframe first. Like a link, something like that:

urls = response.css('iframe::attr(src)').extract()
for url in urls :
    yield scrapy.Request(url...., target=parse_iframe)

then define a new parse_iframe method where you parse the iframes response.

nosklo
  • 217,122
  • 57
  • 293
  • 297
  • Here is a similar question: https://stackoverflow.com/questions/52779161/python-scrapy-json-xpath-how-to-scrape-json-data-with-scrapy/52779299#52779299 Could you please answer? – Debbie Oct 12 '18 at 13:06