1

I am training on how to use scrapy shell in the command prompt and here's the URL https://shopee.com.my/shop/145423/followers/?__classic__=1

For the google chrome developers (F12 pressed) and at the Network section, I have cleared everything and scoll down the website and got this link https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133 The link is supposed to return some data but when trying

scrapy shell https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133

I got 404 as a response. I think there's a popup that needs the user to click on the language and this is what makes the problem How can such popup dealed with or skipped?

YasserKhalil
  • 9,138
  • 7
  • 36
  • 95
  • Have you tried wrapping the URL in quotes, like `scrapy shell "the_url"`? Maybe the shell is interpreting `&limit=20` etc as setting environment variables? – ForceBru Jan 16 '21 at 09:49
  • I have tried your suugestion but the same problem. I have exported the `response.text` and noticed that part `
    Select Your Language
    ` so I think this is related to pop-up
    – YasserKhalil Jan 16 '21 at 10:23
  • It shouldn't be related to the pop-up. 404 means "there's no such page", while the pop-up comes _from the page you successfully loaded_, so it's an entirely different issue now. – ForceBru Jan 16 '21 at 10:46

1 Answers1

1

Use User Agent . You can also use User Agent in command line

 headers={'User-Agent': 'Mybot'}
>>> r = scrapy.Request(url, headers=headers)
>>> fetch(r)
2021-01-16 16:53:11 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> from <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133>
2021-01-16 16:53:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/shop/145423/followers/?offset=60&limit=20&offset_of_offset=0&_=1610787400133&__classic__=1> (referer: None)
>>> response.status
200
>>> 
Samsul Islam
  • 2,581
  • 2
  • 17
  • 23
  • Amazing. Can you explain what's `Mybot` used as header user-agent? – YasserKhalil Jan 16 '21 at 11:09
  • I tried such a line `response.css('li.clickable_area::text').getall()` but all what I got is only line breaks althrough there is `response.text` and the data is there.!! – YasserKhalil Jan 16 '21 at 11:17
  • 1
    User Agent can be firefox, chrome , details https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent . response.css('li.clickable_area a::text').getall() . Hope its help you. – Samsul Islam Jan 16 '21 at 11:25
  • That's great. As for the line `response.css('li.clickable_area a::text').getall()`, it works but I got a lot of empty results and also each name in the results are twice. How can this be achieved to get the names on the URL? – YasserKhalil Jan 16 '21 at 11:31
  • as a list response.css('li.clickable_area a::attr("href")').getall() or for the single link use get() only – Samsul Islam Jan 16 '21 at 11:33
  • That's wonderful but I got each name three times.`['/empirefitnesssolution', '/empirefitnesssolution', '/empirefitnesssolution', '/nahyeock', '/nahyeock', '/nahyeock',` – YasserKhalil Jan 16 '21 at 12:00
  • 1
    I think you need to select proper css selector or use filter – Samsul Islam Jan 16 '21 at 12:06
  • for i in response.css('li.clickable_area'): ... print(i.css('a::attr("href")').get()) – Samsul Islam Jan 16 '21 at 12:15
  • Amazing. Thank you very much. Last point, why I got slash at the start of the string? – YasserKhalil Jan 16 '21 at 12:18
  • 1
    Because this is a part of url. you can join it with main url using response.urljoin() for i in response.css('li.clickable_area'): ... part =i.css('a::attr("href")').get() ... response.urljoin(part) – Samsul Islam Jan 16 '21 at 12:32
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/227437/discussion-between-samsul-islam-and-yasserkhalil). – Samsul Islam Jan 17 '21 at 06:37