0

I am very new to Python and I am trying to scrape Twitter with the help of Selenium (see code below). I have a list of websites saved in an csv and the code I wrote should go through those websites one by one, scroll through them and scrape specific information on every website. All the infos should be ideally saved in a csv at the end. I was able to get the Selenium part of my code and the looping part of my code to work separately, but I cannot get them to work together. I want to save all scraped infos from all the websites (URLs) in a csv at the end, but I always end up with an empty csv at the end.

Can someone please help? I would really appreciate if someone could help me with my code below!

#Do imports
import csv 
import time
import selenium
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait 
from selenium.webdriver.common.action_chains import ActionChains
import time

driver = webdriver.Chrome(executable_path=r"/chromedriver")

tweets = []

with open('BKQuotedTweetsURL.csv', 'rt') as BK_csv:
    BK_url = csv.reader(BK_csv)
    for row in BK_url:
        links = row[0]
        tweets.append(links)

#link should be something like "https://.com"
for link in tweets:
    driver.get(link)
    time.sleep(10)
            
    # Get scroll height after first time page load
    last_height = driver.execute_script("return document.body.scrollHeight")

    last_elem=''
    current_elem=''

    while True:
            
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait to load page
        time.sleep(5)
        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
           break
        last_height = new_height
            
            
        #update all_tweets to keep loop
        all_tweets = driver.find_elements(By.XPATH, '//div[@data-testid]//article[@data-testid="tweet"]')

        for item in all_tweets[1:]: # skip tweet already scrapped

            print('--- date ---')
            try:
                date = item.find_element(By.XPATH, './/time').text
            except:
                date = '[empty]'
            print(date)

            print('--- text ---')
            try:
                text = item.find_element(By.XPATH, './/div[@data-testid="tweetText"]').text
            except:
                text = '[empty]'
            print(text)
            
            print('--- replying_to ---')
            try:
                replying_to = item.find_element(By.XPATH, './/div[contains(text(), "Replying to")]//a').text
            except:
                replying_to = '[empty]'
            print(replying_to)
            
            #Append new tweets replies to tweet array
            tweets.append([replying_to, text, date])
                       
            if (last_elem == current_elem):
                result = True
            else:
                last_elem = current_elem


df = pd.DataFrame(tweets, columns=['Replying to', 'Tweet', 'Date of Tweet'])
df.to_csv(r'BKURLListComm.csv', index=False, encoding='utf-8') #save a csv file in the downloads folder, change it to your structure and desired folder

I think something might be wrong with the looping but I am not sure, since I tried a lot of different things I found on other websites and questions, but nothing helped.

Zabina
  • 1
  • One issue is the key executable path is deprecated. Review a popular answer at Stack: https://stackoverflow.com/questions/69918148/deprecationwarning-executable-path-has-been-deprecated-please-pass-in-a-servic – Gray Dec 18 '22 at 22:58

1 Answers1

0

It looks like you are appending the new scraped data to the tweets list, but the list is also being used as the loop variable in the outer for loop. This means that each time you append to the list, the loop variable link will change as well, which might cause the loop to behave unexpectedly.

To fix this, you can use a separate list to store the scraped data. For example:

scraped_data = []

for link in tweets:
    driver.get(link)
    time.sleep(10)
    ...
    for item in all_tweets[1:]:
        ...
        scraped_data.append([replying_to, text, date])

df = pd.DataFrame(scraped_data, columns=['Replying to', 'Tweet', 'Date of Tweet'])
df.to_csv(r'BKURLListComm.csv', index=False, encoding='utf-8')