1

Why am I not getting the text? I've used this script on many websites and never came across this issue.

import scrapy.selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Prijsvergelijking_Final.items import PrijsvergelijkingFinalItem

vendors = []
for line in open("vendors.txt", "r"):
    vendors.append(line.strip("\n\-"))
e = {}
for vendor in vendors:
    e[vendor] = True

class ArtcrafttvSpider(CrawlSpider):
    name = "ARTCRAFTTV"
    allowed_domains = ["artencraft.be"]
    start_urls = ["https://www.artencraft.be/nl/beeld-en-geluid/televisie"]
    rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//li[@class="next"]',)), callback = "parse_start_url",follow = True),)    
    def parse_start_url(self, response):
        products = response.xpath("//ul[@class='product-overview list']/li")
        for product in products:
            item = PrijsvergelijkingFinalItem()
            item["Product_a"] = product.xpath(".//a/span/h3/text()").extract_first().strip().replace("-","")
            item["Product_price"] = product.xpath(".//a/h4/text()").extract_first()
            for word in item['Product_a'].split(" "):
                if word in e:
                    item['item_vendor'] = word              
            yield item

Website code:

HTML

Results after script is run:

Results

Any suggestions how I can solve this?

Tony
  • 1,318
  • 1
  • 14
  • 36
Wouter
  • 173
  • 15

2 Answers2

1

Short Answer would be:

You have a wrong xpath for price field value

Detailed:

do not always assume that page structure will be same as what is displayed on your screen. it is NOT always WYSIWYG

for some reason i see that inspect element(firefox) shows a price value as child of //a/h4 tag but if you will analyze the page source that is downloaded, you will see that price value is present on page but is it no child of //a/h4 tag but it is a child of //a tag so //a/text() would give you the desired value

MrPandav
  • 1,831
  • 1
  • 20
  • 24
0

It appears that the prices are loaded in from Javascript or something- when I pull down the page from Python I get no prices anywhere.

There's two possible things going on here: First, the prices might be loading in with Javascript. If that's the case, I recommend looking at this answer: https://stackoverflow.com/a/26440563/629110 and the library dryscape.

If the prices are being blocked because of your user agent, you can try to change your user agent to a real browser: https://stackoverflow.com/a/10606260/629110 .

Try the user agent first (since it is easier).

Community
  • 1
  • 1
Laxsnor
  • 857
  • 1
  • 12
  • 21