TripAdvisor scraping Python script is exporting multiple, different versions of rows

Question

I am working on this scraping script for an academic research paper. I am definitely a newbie, self-taught, and have cobbled this together!

What I want: a csv with approximately 560 rows; one column per date (mdyyyy), review, rating, and username (username is not presently accounted for in the script, FYI).

I have gotten it to run without error, but the output is wrong. I get thousands of rows. The script is looping and outputting data in multiple formats: 1) 500ish rows with month/date and review 2) 500ish rows with rating and review 3) 500ish rows with name, date, review all in the same column .... and so on.

I've spent a few hours trying to fix this problem, and now I have another:

Traceback (most recent call last): line 49, in date = " ".join(date[j].text.split(" ")[-2:]) IndexError: list index out of range

Running this in 3.9.6, if that makes a difference.

I have three questions:

How do I fix this date out of range issue?
Is there anything glaringly wrong with the script that is causing it to create thousands of rows with different formats?
How can I add the username in? I have tried to do so and can't seem to find the correct xpath. Here is the website I am scraping: https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html

import csv
from selenium import webdriver
import time

# default path to file to store data
path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"

# default number of scraped pages
num_page = 5

# default tripadvisor website of hotel or things to do (attraction/monument) 
url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
#url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"

# if you pass the inputs in the command line
if (len(sys.argv) == 4):
    path_to_file = sys.argv[1]
    num_page = int(sys.argv[2])
    url = sys.argv[3]

# import the webdrive -- NMS 20210705
driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")

# open the file to save the review
csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])

# change the value inside the range to save more or less reviews
for i in range(0, 48, 1):

    # expand the review 
    time.sleep(2)

# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)
    container = driver.find_elements_by_xpath(".//div[@class='review-container']")
    
# grab also the date of review
    date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")

    for j in range(len(container)):

        rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
        title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", "  ")
        review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
        date = " ".join(date[j].text.split(" ")[-2:])
                                                  
#write data into csv
        csvWriter.writerow([title, rating, review, date])
        
# change the page            
    driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()

#quite selenium
driver.quit()
                                                  
#FYI you need to close all windows for the file to write ```

score 0 · Answer 1 · answered Jul 09 '21 at 08:58

0

That date finder was coming back empty, so [j] failed on it. The review date is within the container, so you can get it along with everything else.

    rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
    person = container[j].find_element_by_class_name('info_text').text.split("\n")[0]#person but not place
    title = container[j].find_element_by_css_selector('span.noQuotes').text.replace("\n", "  ")
    review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", "  ")
    review_date = container[j].find_element_by_class_name('ratingDate').text[9:]

changes: just the span for the title, not the entire div. added code to find person (stripping off place on 2nd line) found the date within the container and removed "Reviewed "

answered Jul 09 '21 at 08:58

Jeremy Kahan

3,796
1
10
23

Jeremy, you're a life saver! If you have time, would you mind explaining to me why (when the script would run) the csv would have thousands of rows and have the data formatted in so many different ways? Is it because I mistakenly had a date finder and the script was looping through both that and the container finder? – Natalie Susmann Jul 09 '21 at 15:09
Also, any idea why there are blank lines in between each entry? I tried changing open(path_to_file, 'w') to open(path_to_file, 'a') and that didn't work. – Natalie Susmann Jul 09 '21 at 16:03
Maybe I did notice a carriage return at the end of the date. Maybe after my [9:] you want to add .replace(/n,''). Your theory about extra lines seems plausible. But I think having div not span for the title meant you were getting all sorts of extra stuff in unpredictable formats. – Jeremy Kahan Jul 10 '21 at 19:17
nothing to do with trailing return on the date. you need to add newline='' as a third parameter to your call on open(). See https://stackoverflow.com/questions/3348460/csv-file-written-with-python-has-blank-lines-between-each-row – Jeremy Kahan Jul 10 '21 at 20:20

TripAdvisor scraping Python script is exporting multiple, different versions of rows

1 Answers1