Goal: I am writing a Python program to scrape data on posts in a private Facebook group using Selenium Chrome WebDriver to automate logging in to my Facebook account which is part of the private group and scroll down the page to keep loading posts up until a certain date. I then use Beautiful Soup to scrape the post content, username, number of likes, comments, date, etc.
Issue: After scrolling and scraping for a few minutes, I receive an out of memory error. I figured the webpage was continuously adding more div elements to the page when more posts loaded and eventually taking up too much memory, so I try to execute a script that removes the post element after I get the data from it. It still gives an out of memory error, but it appears that it is successfully removing posts because I can see that earlier posts are missing from the top of the feed. How do I continuously scroll and scrape data without getting out of memory error? My condensed code is below. Towards the end of the code, I try to check if the element I am about to remove is the one that was just scraped, but it's not working.
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options, chrome_options=chrome_options)
driver.get("https://www.facebook.com")
driver.maximize_window()
wait = WebDriverWait(driver, 0.1)
sleep(2)
# login to FB using phone number and password and navigate to desired FB group
username_list = []
soup = BeautifulSoup(driver.page_source, "html.parser")
# scroll and load all posts up to the desired date (ideally August 9)
# out of memory error raised after scrolling for a while since the post div elements are constantly being added to page
desired_date = None
while not desired_date:
sleep(2)
# scroll to the bottom of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
# scroll up a little to allow more posts to load
driver.execute_script("window.scrollBy(0, -5)")
try:
# scroll until the post from August 6, 2022 has been loaded on the page
desired_date = wait.until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), 'December 12')]")))
except TimeoutException:
pass
# returns beautiful soup tag objects
all_posts = soup.find_all("div",{"class": "x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z"})
for post in all_posts:
# username
try:
name = post.find("a",{"class":"x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f"}).get_text()
except:
name = "Anonymous"
# add each data element to the dataframe
username_list.append(name)
# remove the post element from the page to free up memory once the desired data is collected
# find_element() returns DOM element to be removed after data collection (every post element has same class name)
# cannot find element by CLASS_NAME because of the spaces in the class name, must switch to cssSelector locator
element = driver.find_element(By.CSS_SELECTOR, "div.x1yztbdb.x1n2onr6.xh8yej3.x1ja2u2z")
# check that the element to be removed is the post that was just scraped
#discard_element = element.find_element(By.CSS_SELECTOR, ".x1i10hfl.xjbqb8w.x6umtig.x1b1mbwd.xaqea5y.xav7gou.x9f619.x1ypdohk.xt0psk2.xe8uvvx.xdj266r.x11i5rnm.xat24cr.x1mh8g0r.xexx8yu.x4uap5.x18d9i69.xkhd6sd.x16tdsg8.x1hl2dhg.xggy1nq.x1a2a7pz.xt0b8zv.xzsf02u.x1s688f").text
if (True): # ideally if (discard_element == name) but changed to True for now
# arguments[0] is the element in the second parameter of execute_script()
driver.execute_script("""
var element = arguments[0];
element.remove();""", element)
# print(f"{discard_element} removed")
else:
print(f"Current post: {name}")
print(f"Post to be removed: {discard_element}")