1

I know how to fetch the XPATHs for HTML datapoints with Scrapy. But I have to scrape all the URLs(starting URLs), of this page on this site, which are written in JSON format:

https://highape.com/bangalore/all-events

view-source:https://highape.com/bangalore/all-events

I usually write this in this format:

def parse(self, response):
      events = response.xpath('**What To Write Here?**').extract()

      for event in events:
          absolute_url = response.urljoin(event)
          yield Request(absolute_url, callback = self.parse_event)

Please tell me what I should write in 'What To Write Here?' portion.

enter image description here

Debbie
  • 911
  • 3
  • 20
  • 45

2 Answers2

2

View page source of the url then copy line 76 - 9045 and save as data.json in your local drive then use this code...

import json
from bs4 import BeautifulSoup
import requests
req = requests.get('https://highape.com/bangalore/all-events')
soup = BeautifulSoup(req.content, 'html.parser')
js = soup.find_all('script')[5].text
data = json.loads(js, strict=False)
for i in data:
    url = i['url']
    print(url)
    ##callback with scrapy
Sohan Das
  • 1,560
  • 2
  • 15
  • 16
  • Hi, your solution worked. But as u see the url was for Bangalore city. https://highape.com/bangalore/all-events For only Bangalore I am maintaining a big file in my machine. Also new events will be kept adding and old events will be removed. So I have to update the content of that file everyday. Also for all cities it's practically impossible to maintain big files in local. So your solution is impractical. Could you please suggest me something else? – Debbie Oct 12 '18 at 16:11
  • Answer updated! if you like please give upvote and accept! – Sohan Das Oct 12 '18 at 16:51
  • Sure. I just need some time to check if the answer works for me. – Debbie Oct 12 '18 at 16:53
0

What to write here?

events = response.xpath("//script[@type='application/ld+json']").extract()
events = json.loads(events[0])
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • response.xpath("//script[@type='application/ld+json']").extract() - fetches line 75 to line 9046 on view source page. This line: events = json.loads(events) gives this error: TypeError: expected string or buffer. If I modify the second line and write: for event in events: event1 = json.loads(event) I get this error: ValueError: No JSON object could be decoded – Debbie Oct 12 '18 at 18:23
  • @Debbie looks like we have to call `str()` on it? Edited my answer – nosklo Oct 13 '18 at 02:08
  • `events` is a list b/c `extract()` returns a list. In this case there are two elements retrieved with that xpath. If you call `events = response.xpath("//script[@type='application/ld+json']/text()") .extract_first()` you'll get the desired data. You'll still get a JSONDecodeError because there are literal `\r` and `\n` chars in the data (first example is right after "All these games are developed by highly skilled developers who ensure that the"). See [here](https://stackoverflow.com/questions/9295439/python-json-loads-fails-with-valueerror-invalid-control-character-at-line-1-c) for help – pwinz Oct 13 '18 at 18:02