I am working on this scraping script for an academic research paper. I am definitely a newbie, self-taught, and have cobbled this together!
What I want: a csv with approximately 560 rows; one column per date (mdyyyy), review, rating, and username (username is not presently accounted for in the script, FYI).
I have gotten it to run without error, but the output is wrong. I get thousands of rows. The script is looping and outputting data in multiple formats: 1) 500ish rows with month/date and review 2) 500ish rows with rating and review 3) 500ish rows with name, date, review all in the same column .... and so on.
I've spent a few hours trying to fix this problem, and now I have another:
Traceback (most recent call last): line 49, in date = " ".join(date[j].text.split(" ")[-2:]) IndexError: list index out of range
Running this in 3.9.6, if that makes a difference.
I have three questions:
How do I fix this date out of range issue?
Is there anything glaringly wrong with the script that is causing it to create thousands of rows with different formats?
How can I add the username in? I have tried to do so and can't seem to find the correct xpath. Here is the website I am scraping: https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html
import csv
from selenium import webdriver
import time
# default path to file to store data
path_to_file = "D:\Documents\Archaeology\Projects\Patmos\scraped\monastery6.csv"
# default number of scraped pages
num_page = 5
# default tripadvisor website of hotel or things to do (attraction/monument)
url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
#url = "https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html"
# if you pass the inputs in the command line
if (len(sys.argv) == 4):
path_to_file = sys.argv[1]
num_page = int(sys.argv[2])
url = sys.argv[3]
# import the webdrive -- NMS 20210705
driver = webdriver.Chrome("C:/Users/nsusm/AppData/Local/Programs/Python/Python39/webdriver/bin/chromedriver.exe")
driver.get("https://www.tripadvisor.com/ShowUserReviews-g189447-d207187-r773649540-Monastery_of_St_John-Patmos_Dodecanese_South_Aegean.html")
# open the file to save the review
csvFile = open(path_to_file, 'a')
csvFile = open(path_to_file, 'a', encoding="utf-8")
csvWriter = csv.writer(csvFile, delimiter=',')
csvWriter.writerow([str ('title'), str ('rating'), str ('review'), str ('date')])
# change the value inside the range to save more or less reviews
for i in range(0, 48, 1):
# expand the review
time.sleep(2)
# define container (this is the whole box of the Trip Advisor review, excluding the date of the review)
container = driver.find_elements_by_xpath(".//div[@class='review-container']")
# grab also the date of review
date = driver.find_elements_by_xpath(".//class[@class='prw_reviews_stay_date_hsx']")
for j in range(len(container)):
rating = container[j].find_element_by_xpath(".//span[contains(@class, 'ui_bubble_rating bubble_')]").get_attribute("class").split("_")[3]
title = container[j].find_element_by_xpath(".//div[contains(@class, noQuotes)]").text.replace("\n", " ")
review = container[j].find_element_by_xpath(".//p[@class='partial_entry']").text.replace("\n", " ")
date = " ".join(date[j].text.split(" ")[-2:])
#write data into csv
csvWriter.writerow([title, rating, review, date])
# change the page
driver.find_element_by_xpath('.//a[@class="nav next ui_button primary"]').click()
#quite selenium
driver.quit()
#FYI you need to close all windows for the file to write ```