Web scraping a private Facebook group that loads more posts after scrolling causes out of memory error

Question

Goal: I am writing a Python program to scrape data on posts in a private Facebook group using Selenium Chrome WebDriver to automate logging in to my Facebook account which is part of the private group and scroll down the page to keep loading posts up until a certain date. I then use Beautiful Soup to scrape the post content, username, number of likes, comments, date, etc.

Issue: After scrolling and scraping for a few minutes, I receive an out of memory error. I figured the webpage was continuously adding more div elements to the page when more posts loaded and eventually taking up too much memory, so I try to execute a script that removes the post element after I get the data from it. It still gives an out of memory error, but it appears that it is successfully removing posts because I can see that earlier posts are missing from the top of the feed. How do I continuously scroll and scrape data without getting out of memory error? My condensed code is below. Towards the end of the code, I try to check if the element I am about to remove is the one that was just scraped, but it's not working.

The div element that I try to remove has class "x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z" and is pictured here.

driver = webdriver.Chrome(ChromeDriverManager().install(), options=options, chrome_options=chrome_options)
driver.get("https://www.facebook.com")
driver.maximize_window()
wait = WebDriverWait(driver, 0.1)
sleep(2)

# login to FB using phone number and password and navigate to desired FB group

username_list = []

soup = BeautifulSoup(driver.page_source, "html.parser")

# scroll and load all posts up to the desired date (ideally August 9)
# out of memory error raised after scrolling for a while since the post div elements are constantly being added to page
desired_date = None
while not desired_date:
    sleep(2)
    # scroll to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    # scroll up a little to allow more posts to load
    driver.execute_script("window.scrollBy(0, -5)")
    try:
        # scroll until the post from August 6, 2022 has been loaded on the page
        desired_date = wait.until(EC.presence_of_element_located((By.XPATH, "//*[contains(text(), 'December 12')]")))
    except TimeoutException:
        pass
    
    # returns beautiful soup tag objects
    all_posts = soup.find_all("div",{"class": "x1yztbdb x1n2onr6 xh8yej3 x1ja2u2z"})

    for post in all_posts:
        # username
        try:
            name = post.find("a",{"class":"x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz xt0b8zv xzsf02u x1s688f"}).get_text()
        except:
            name = "Anonymous"

        # add each data element to the dataframe
        username_list.append(name)

        # remove the post element from the page to free up memory once the desired data is collected
        # find_element() returns DOM element to be removed after data collection (every post element has same class name)
        # cannot find element by CLASS_NAME because of the spaces in the class name, must switch to cssSelector locator
        element = driver.find_element(By.CSS_SELECTOR, "div.x1yztbdb.x1n2onr6.xh8yej3.x1ja2u2z")
        # check that the element to be removed is the post that was just scraped
        #discard_element = element.find_element(By.CSS_SELECTOR, ".x1i10hfl.xjbqb8w.x6umtig.x1b1mbwd.xaqea5y.xav7gou.x9f619.x1ypdohk.xt0psk2.xe8uvvx.xdj266r.x11i5rnm.xat24cr.x1mh8g0r.xexx8yu.x4uap5.x18d9i69.xkhd6sd.x16tdsg8.x1hl2dhg.xggy1nq.x1a2a7pz.xt0b8zv.xzsf02u.x1s688f").text
        if (True): # ideally if (discard_element == name) but changed to True for now
            # arguments[0] is the element in the second parameter of execute_script()
            driver.execute_script("""
                var element = arguments[0];
                element.remove();""", element) 
            # print(f"{discard_element} removed")
        else:
            print(f"Current post: {name}")
            print(f"Post to be removed: {discard_element}")

Maybe you could try to delete the div you're done processing? — Biskweet, Jan 24 '23 at 15:37
@Biskweet That is what I try to do when I execute element.remove() where element is the post div, and it seems to be removing something but I still get an out of memory error. — Teresa Vu, Jan 24 '23 at 15:46

Web scraping a private Facebook group that loads more posts after scrolling causes out of memory error

0 Answers0