0

I'm trying to get all links on a page 'https://www.jumia.com.eg' using scrapy.

The code is like this:

all_categories = response.xpath ('//a')

But I found a lot of missing links in the results.

The count of the results is 242 links.

When I tried Chrome developer tools, I got all the links, the count of the results was 608 with the same selector xpath (//a).

enter image description here

Why doesn't Scarpy get all the links using the mentioned selector while Chrome does?

Ehab
  • 566
  • 6
  • 24
  • 1
    The other links are loaded via javascript. –  Jul 20 '20 at 19:04
  • I'm now aware about this problem. Is there a solution to this? – Ehab Jul 22 '20 at 13:11
  • Try the suggestions in https://docs.scrapy.org/en/latest/topics/dynamic-content.html#selecting-dynamically-loaded-content –  Jul 22 '20 at 13:39

2 Answers2

0

That's because the website is using reCAPTCHA.

If you type: view(response) in scrapy shell, you would noticed that you are actually parsing the reCAPTCHA page (which explains the unexpected a tags):

enter image description here

You can try solving the reCAPTCHA (not sure how easy that would be, but this question might help)... Alternatively you can run your scraper from a proxy, such as Crawlera which uses rotating IPs... I have not used Crawlera but according to their website, it retries the page several times (with different IPs) until it hits a clean page.

Hooman Bahreini
  • 14,480
  • 11
  • 70
  • 137
0

It turned out that the problem is because data is loaded using Javascript as Justin commented.

Ehab
  • 566
  • 6
  • 24