Getting empty response from scrapy shell while crawling monsterindia.com

Question

I am trying to crawl few pages from monsterindia.com. But whenever I write any xpath on scrapy shell, it gives me empty result. However, there should be some way because view(response) command gives me the same html page.

I ran this command :

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

on my terminal and then tried several ways formulating different xpaths like - response.xpath('//*[@class="job-tittle"]/text()').extract() . But no luck .. always got empty result.

on terminal:

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

then, response.xpath('//div[@class="job-tittle"]/text()').extract() got empty result.

then, response.xpath('//*[@class="card-apply-content"]/text()').extract() got empty result.

I expect it to give some results, I mean the text from the website after crawling. Please help me with it.

this is because of javascript rendering. Right, the response is "kind of empty", meaning that it doesn't really have all the information that you see when loading the same page, but it contains all the necessary javascript "scripts" to render those results. You'll need what the javascript code inside the response is doing to understand and get the results needed. — eLRuLL, Apr 12 '19 at 16:51
eLRuLL.. Thanks for the reply. Could you please tell me a way or show a path to do that? — Pratyush Behera, Apr 12 '19 at 21:41

score 3 · Answer 1 · answered Apr 12 '19 at 16:56

3

The data you're looking for isn't on the home page, but in the responses retrieved after the page load. If you check the "View Page Source" in your browser, you will see what actually came in the first request.

And by inspecting the network tab in dev tools, you will see the further requests, like the one to this URL: https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=25

answered Apr 12 '19 at 16:56

Thiago Curvelo

3,711
1
22
38

1

Also check the answers to https://stackoverflow.com/q/8550114/939364, they may prove helpful as well. – Gallaecio Apr 13 '19 at 11:10

QHarr · Accepted Answer · 2019-04-13T01:47:55.577

So what Thiago I think was getting at is that the page updates with xhr requests which include a results count query string parameter. This returns json you can parse. So you change your url to that and handle json accordingly.

Using requests to demonstrate

import requests
from bs4 import BeautifulSoup as bs
import json

r = requests.get('https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=100')
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('p').text)['jobSearchResponse']['data']

for item in data:
    print(item)

JSON of first item

https://jsoneditoronline.org/?id=fe49c53efe10423a8d49f9b5bdf4eb36

With scrapy:

jsonres = json.loads(response.body_as_unicode()

Yes there is. Have another look – QHarr Apr 14 '19 at 18:45 — QHarr, Apr 14 '19 at 18:45

Getting empty response from scrapy shell while crawling monsterindia.com

2 Answers2