0

I need some help with removing the text in links while extracting overall texts, while getting the comments from product

scrapy shell "https://www.amazon.com/Bond-Touch-Bracelets-Long-Distance-Lovers/dp/B07VBP4R7F/ref=zg_bs_electronics_49?_encoding=UTF8&psc=1&refRID=1167P05RGXTPJ40CAX5B" #Terminal

my_str = ""
ind=1
for i in response.xpath('//*[@id="cm-cr-dp-review-list"]/div'):
    c=i.xpath('.//*[@data-hook="review-body"]//text()').extract()
    d=[i.strip() for i in c if i.strip()]
    my_str += f"{ind}) {' '.join(d)}\n"
    ind+=1
print(my_str)

my code works so far, but if there is a link in the comment I am getting it as well (which I do not want),how can I edit my XPath so it skips the texts in a tag

I checked some of the answers on Removing markup links in text and it's using regex to do it which didn't work here.

  • 2
    Please include your imports with your code so that people can see what libs you're using. – kpie Feb 15 '20 at 17:36
  • @kpie if you mean the getUrl I am basically requesting the page and pointing the scrapy selector to it (I didn't include it because it is a long function) –  Feb 15 '20 at 17:43
  • 2
    you might consider including a sample of the html/xml that matches/doesn't mach. Many people, including my self, will not go to the effort of fetching the amazon link and going through the html code to see what is happening. – Sorin Feb 15 '20 at 17:54
  • it doesn't. When I say a sample, I mean an actual copy of the relevant html, not a link to ~512k of mostly irrelevant html. – Sorin Feb 15 '20 at 18:04
  • 2
    also, as @kpie pointed out you need to include the relevant imports in your code, since the html from amazon errors out with xpath from perl and xmllint and your code is incomplete, I can't even start to reproduce your issue. – Sorin Feb 15 '20 at 18:15
  • OK, you're right, shame on me for actually trying to help you. Good luck! – Sorin Feb 15 '20 at 18:20

0 Answers0