-1

I'm studying scrapy and am trying to crawl through this website - http://bananarepublic.gap.com/browse/category.do?cid=1055063&sop=true

However my scrapy code cannot find the product links listed on this website. Could anyone tell me why? The xpath Im using is //a[@class="product-card--link"]/@href

Is this because of js? If so, I tried using scrapy splash but still cannot find the product links listed. Could someone please help!

Thank you!

user6055239
  • 83
  • 2
  • 13

2 Answers2

1

The items are generated via AJAX request. When you connect to a page a javascript script is executed that makes some extra http requests to retreive some json data. However scrapy does not execute any javascript so you need to manually find and call those AJAX requests.

See related issue: Can scrapy be used to scrape dynamic content from websites that are using AJAX?, to see how inspect network traffic and solve such cases.

In this particular case you can see first xhr requests that is being made returns a huge json file with all of the item data:

http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063&isFacetsEnabled=true&globalShippingCountryCode=&globalShippingCurrencyCode=&locale=en_US&

As you can see url takes some arguments, most importantly it takes cid which stands for category id and other arguments are mostly for calculating shipping prices, so if you don't care about those this works just as well:

http://bananarepublic.gap.com/resources/productSearch/v1/search?cid=1055063

Community
  • 1
  • 1
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
1

An alternative that avoids digging deep into the AJAX requests would be using Splash (https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash/) to scrape the page after the AJAX has been processed.

Can be a bit easier to implement, your xpath expression should work fine with Splash. But the scraper will be slower as it has to render each page.

Done Data Solutions
  • 2,156
  • 19
  • 32