I am very new to Python and I am trying to scrape Twitter with the help of Selenium (see code below). I have a list of websites saved in an csv and the code I wrote should go through those websites one by one, scroll through them and scrape specific information on every website. All the infos should be ideally saved in a csv at the end. I was able to get the Selenium part of my code and the looping part of my code to work separately, but I cannot get them to work together. I want to save all scraped infos from all the websites (URLs) in a csv at the end, but I always end up with an empty csv at the end.
Can someone please help? I would really appreciate if someone could help me with my code below!
#Do imports
import csv
import time
import selenium
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome(executable_path=r"/chromedriver")
tweets = []
with open('BKQuotedTweetsURL.csv', 'rt') as BK_csv:
BK_url = csv.reader(BK_csv)
for row in BK_url:
links = row[0]
tweets.append(links)
#link should be something like "https://.com"
for link in tweets:
driver.get(link)
time.sleep(10)
# Get scroll height after first time page load
last_height = driver.execute_script("return document.body.scrollHeight")
last_elem=''
current_elem=''
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(5)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
#update all_tweets to keep loop
all_tweets = driver.find_elements(By.XPATH, '//div[@data-testid]//article[@data-testid="tweet"]')
for item in all_tweets[1:]: # skip tweet already scrapped
print('--- date ---')
try:
date = item.find_element(By.XPATH, './/time').text
except:
date = '[empty]'
print(date)
print('--- text ---')
try:
text = item.find_element(By.XPATH, './/div[@data-testid="tweetText"]').text
except:
text = '[empty]'
print(text)
print('--- replying_to ---')
try:
replying_to = item.find_element(By.XPATH, './/div[contains(text(), "Replying to")]//a').text
except:
replying_to = '[empty]'
print(replying_to)
#Append new tweets replies to tweet array
tweets.append([replying_to, text, date])
if (last_elem == current_elem):
result = True
else:
last_elem = current_elem
df = pd.DataFrame(tweets, columns=['Replying to', 'Tweet', 'Date of Tweet'])
df.to_csv(r'BKURLListComm.csv', index=False, encoding='utf-8') #save a csv file in the downloads folder, change it to your structure and desired folder
I think something might be wrong with the looping but I am not sure, since I tried a lot of different things I found on other websites and questions, but nothing helped.