-2

I was using scrapy for web scraping, I can grab all elements but my target is to get all the names having reviews greater than 50 , I don't know where I am lacking

import scrapy


class TripadSpider(scrapy.Spider):
    name = 'tripad'
    allowed_domains = ['www.tripadvisor.in']
    start_urls = ['https://www.tripadvisor.in/Restaurants-g304554-c33-Mumbai_Maharashtra.html']
    first = 'https://www.tripadvisor.in/'

    def parse(self, response):
        for i in response.xpath("//div[@class='_2Q7zqOgW Vt o']"):
            rating = str(i.xpath(".//span[@class='w726Ki5B']/text()").get())

            if rating >= '50':
                title = i.xpath(".//a[@class='_15_ydu6b S5 H4 Cj b']/text()").getall()
                yield {
                    'title':title,
                    'rating':rating
                }
            elif rating == 'None':
                continue

        next_page = response.xpath("//a[@class='nav next rndBtn ui_button primary taLnk']/@href").get()
        if next_page:
            sequence = (self.first,next_page)
            nexturl = ''.join(sequence)
            yield scrapy.Request(url=nexturl,callback=self.parse)

can somebody assist me

  • convert `rating` to `int` , `if int(rating) >= 50:` – sittsering Sep 02 '21 at 04:40
  • ValueError: invalid literal for int() with base 10: '1,076 getting this error – Iswar Chand Sep 02 '21 at 04:42
  • @sittsering the string comparison will still work as when used on strings it will compare lexicographical order – sb_ Sep 02 '21 at 04:42
  • this script is working for few pages then starts misbehaving – Iswar Chand Sep 02 '21 at 04:44
  • So with your ValueError it shows the rating is 1,076. When comparing '1,076' >= '50' , because of the comma, it will not return true. You will need to parse the rating and treat is as an integer – sb_ Sep 02 '21 at 04:48
  • you can't check the string values as less than or greater than condition. only integer datatype can apply for less than or greater than condition. so you can convert the rating in to int like a if int(rating) >= 50: – Dev Sep 02 '21 at 04:48
  • ok then how to remove comma, any idea – Iswar Chand Sep 02 '21 at 04:49
  • try ```int(float(rating))```. Then you will have to change your if statement to compare integers – sb_ Sep 02 '21 at 04:50
  • ValueError: could not convert string to float: '1,076' **if int(float(rating)) >= 50** – Iswar Chand Sep 02 '21 at 04:51
  • @Dev then what approach should i apply – Iswar Chand Sep 02 '21 at 04:53
  • you sholud convert the rating value as integer like if int(rating) >= 50: – Dev Sep 02 '21 at 04:54
  • 2
    Try this in your python console ```rating = ''.join('1,076'.split(','))``` Here you split the value 1,076 into an array ['1', '076'] then join it back with no spaces. You'll have to check if the rating string contains a comma as this will likely throw an error. Maybe use a try - except to set the rating. Then you just have to parse this string as an integer! – sb_ Sep 02 '21 at 04:55
  • 1
    thanks @sittsering it is working using this https://stackoverflow.com/questions/5188792/how-to-check-a-string-for-specific-characters – Iswar Chand Sep 02 '21 at 05:14
  • 1
    @sb_ that will wont do,i .e, "101" is less than "50" by string comparison. – sittsering Sep 02 '21 at 05:15
  • i ahve done it by converting it to int – Iswar Chand Sep 02 '21 at 05:17

1 Answers1

0

Replace Comma with Decimal

if type(eval(rating)) == int or type(eval(rating)) == float:
    rating_string_decimal = rating.replace(',','.')
    val = float(rating_string_decimal )
    if round(rating_string_decimal)>= 50:
          title = i.xpath(".//a[@class='_15_ydu6b S5 H4 Cj b']/text()").getall()
                yield {
                    'title':title,
                    'rating':rating
                }
elif rating == 'None':
    continue
Dev
  • 387
  • 1
  • 12