0

I am trying to crawl few pages from monsterindia.com. But whenever I write any xpath on scrapy shell, it gives me empty result. However, there should be some way because view(response) command gives me the same html page.

I ran this command :

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

on my terminal and then tried several ways formulating different xpaths like - response.xpath('//*[@class="job-tittle"]/text()').extract() . But no luck .. always got empty result.

on terminal:

scrapy shell "https://www.monsterindia.com/search/computer-jobs"

then, response.xpath('//div[@class="job-tittle"]/text()').extract() got empty result.

then, response.xpath('//*[@class="card-apply-content"]/text()').extract() got empty result.

I expect it to give some results, I mean the text from the website after crawling. Please help me with it.enter image description here

Thiago Curvelo
  • 3,711
  • 1
  • 22
  • 38
Pratyush Behera
  • 103
  • 1
  • 9
  • 1
    this is because of javascript rendering. Right, the response is "kind of empty", meaning that it doesn't really have all the information that you see when loading the same page, but it contains all the necessary javascript "scripts" to render those results. You'll need what the javascript code inside the response is doing to understand and get the results needed. – eLRuLL Apr 12 '19 at 16:51
  • eLRuLL.. Thanks for the reply. Could you please tell me a way or show a path to do that? – Pratyush Behera Apr 12 '19 at 21:41

2 Answers2

3

The data you're looking for isn't on the home page, but in the responses retrieved after the page load. If you check the "View Page Source" in your browser, you will see what actually came in the first request.

And by inspecting the network tab in dev tools, you will see the further requests, like the one to this URL: https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=25

Thiago Curvelo
  • 3,711
  • 1
  • 22
  • 38
  • 1
    Also check the answers to https://stackoverflow.com/q/8550114/939364, they may prove helpful as well. – Gallaecio Apr 13 '19 at 11:10
1

So what Thiago I think was getting at is that the page updates with xhr requests which include a results count query string parameter. This returns json you can parse. So you change your url to that and handle json accordingly.

Using requests to demonstrate

import requests
from bs4 import BeautifulSoup as bs
import json

r = requests.get('https://www.monsterindia.com/middleware/jobsearch?query=computer&sort=1&limit=100')
soup = bs(r.content, 'lxml')
data = json.loads(soup.select_one('p').text)['jobSearchResponse']['data']

for item in data:
    print(item)

JSON of first item

https://jsoneditoronline.org/?id=fe49c53efe10423a8d49f9b5bdf4eb36


With scrapy:

jsonres = json.loads(response.body_as_unicode()
QHarr
  • 83,427
  • 12
  • 54
  • 101