Scrapy XPath doesn't get all links in page while Chrome does

Question

I'm trying to get all links on a page 'https://www.jumia.com.eg' using scrapy.

The code is like this:

all_categories = response.xpath ('//a')

But I found a lot of missing links in the results.

The count of the results is 242 links.

When I tried Chrome developer tools, I got all the links, the count of the results was 608 with the same selector xpath (//a).

Why doesn't Scarpy get all the links using the mentioned selector while Chrome does?

I'm now aware about this problem. Is there a solution to this? — Ehab, Jul 22 '20 at 13:11
Try the suggestions in https://docs.scrapy.org/en/latest/topics/dynamic-content.html#selecting-dynamically-loaded-content — , Jul 22 '20 at 13:39

score 0 · Answer 1 · edited Feb 24 '21 at 22:53

That's because the website is using reCAPTCHA.

If you type: view(response) in scrapy shell, you would noticed that you are actually parsing the reCAPTCHA page (which explains the unexpected a tags):

You can try solving the reCAPTCHA (not sure how easy that would be, but this question might help)... Alternatively you can run your scraper from a proxy, such as Crawlera which uses rotating IPs... I have not used Crawlera but according to their website, it retries the page several times (with different IPs) until it hits a clean page.

score 0 · Accepted Answer · answered Jul 22 '20 at 13:15

0

It turned out that the problem is because data is loaded using Javascript as Justin commented.

answered Jul 22 '20 at 13:15

Ehab

566
6
24

Scrapy XPath doesn't get all links in page while Chrome does

2 Answers2