Selenium scrolling and scraping with BeautifulSoup produces duplicate results

Question

I have this script to download images from Instagram. The only issue I am having is that when Selenium starts scrolling down to the bottom of the webpage, BeautifulSoup starts grabbing the same img src links after requests is being looped.

Although it will continue to scroll down and download pictures, after all that is done, I end up having 2 or 3 duplicates. So my question is is there a way of preventing this duplication from happening?

import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver


url = ('https://www.instagram.com/kitties')
driver = webdriver.Firefox()
driver.get(url)

scroll_delay = 0.5
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

while True:

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    soup = BeautifulSoup(driver.page_source, 'lxml')
    imgs = soup.find_all('img', class_='_2di5p')
    for img in imgs:
        img_url = img["src"]
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    if new_height == last_height:
        break
    last_height = new_height

Update: So I placed this part of the code outside of while True and let selenium load the whole page first in order to hopefully have bs4 scrape all the images. It works to number 30 only and then stops.

soup = BeautifulSoup(driver.page_source, 'lxml')
imgs = soup.find_all('img', class_='_2di5p')
for img in imgs:
    #tn = datetime.now().strftime('%H:%M:%S')
    img_url = img["src"]
    print('=> [+] img_{}'.format(counter))
    screens(counter)
    counter = counter + 1

Mihai Chelaru · Accepted Answer · 2018-05-30T16:24:00.913

2

The reason it only loads 30 in your second version of your script is because the rest of the elements are removed from the page DOM and are no longer part of the source that BeautifulSoup sees. The solution is to keep doing what you were doing the first time, but to remove any duplicate elements before you iterate through the list and call screens(). You can do this using sets as below, though I'm not sure if this is the absolute most efficient way to do it:

import requests
import selenium.webdriver as webdriver
import time

driver = webdriver.Firefox()

url = ('https://www.instagram.com/cats/?hl=en')
driver.get(url)

scroll_delay = 3
last_height = driver.execute_script("return document.body.scrollHeight")
counter = 0

print('[+] Downloading:\n')

def screens(get_name):
    with open("test_images/img_{}.jpg".format(get_name), 'wb') as f:
        r = requests.get(img_url)
        f.write(r.content)

old_imgs = set()

while True:

    imgs = driver.find_elements_by_class_name('_2di5p')

    imgs_dedupe = set(imgs) - set(old_imgs)

    for img in imgs_dedupe:
        img_url = img.get_attribute("src")
        print('=> [+] img_{}'.format(counter))
        screens(counter)
        counter = counter + 1

    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(scroll_delay)
    new_height = driver.execute_script("return document.body.scrollHeight")

    old_imgs = imgs

    if new_height == last_height:
        break
    last_height = new_height

driver.quit()

As you can see, I used a different page to test it, one with 420 images of cats. The result was 420 images, the number of posts on that account, with no duplicates among them.

edited May 30 '18 at 16:24

answered May 30 '18 at 15:53

Mihai Chelaru

7,614
14
45
51

This is absolutely perfect! Exactly what I needed to do. Thanks a lot for that. Is there any good read or some samples for this you showed me? You said using `sets()` ? – P_n May 30 '18 at 16:10
If you want information on sets you can look at the [Python documentation](https://docs.python.org/3.6/library/stdtypes.html?highlight=set#set-types-set-frozenset) for sets, and in particular for the [`difference()`](https://docs.python.org/3.6/library/stdtypes.html?highlight=set#frozenset.difference) method, which lets you remove duplicates in A that also appear in B by doing A - B. I came up with this answer by looking through a bunch of sources, but I didn't find anything good enough to recommend on this. – Mihai Chelaru May 30 '18 at 16:22
Ok, No problem. I also see that you changed from soup to `imgs = driver.find_elements_by_class_name('_2di5p')` So basically soup doesn't have to be used at all? – P_n May 30 '18 at 16:30
Yes, I got rid of `BeautifulSoup` because it's not necessary, as Selenium can parse the elements just as well. No point having an extra dependency if you can avoid it. I wonder if you can go without using `requests` as well, but it seems to work just fine as it is so I see no reason to change it. – Mihai Chelaru May 30 '18 at 16:38
It seems to work perfectly as is. I am just gonna clean up the code a little, and figure out to run webdriver silently, so it doesn't open the browser every time – P_n May 30 '18 at 16:45
See [this answer](https://stackoverflow.com/a/46768243/9374673) on running the Firefox browser in headless mode. – Mihai Chelaru May 30 '18 at 16:48
Awesome. Now it runs silently :) Thanks so much – P_n May 30 '18 at 16:52

score -1 · Answer 2 · answered May 30 '18 at 02:58

-1

I would use os library to check if file already exists

import os


def screens(get_name):
    with open("/home/cha0zz/Desktop/photos/img_{}.jpg".format(get_name), 'wb') as f:
        if os.path.isfile(path/to/the/file):      #checks file exists. Gives false on directory
    # or if os.path.exists(path/to/the/file): #checks file/directory exists
            pass
        else:
            r = requests.get(img_url)
            f.write(r.content)

*I might have messed up the ordering of if and with statements

answered May 30 '18 at 02:58

Biarys

1,065
1
10
22

The issue is that it's fetching the same url while selenium is loading the page. – P_n May 30 '18 at 03:34
hm....so? You said you're having file duplicates. My solution if to check if files exists before saving it in order to avoid duplicates. Please correct me if I misunderstand you. – Biarys May 30 '18 at 03:42
Also, regarding your update. I think it loads up to 30 because DOM loads that far. You might wanna try to scroll all the way to the bottom of the page, and only then soup = BeautifulSoup(driver.page_source, 'lxml'). @uzdisral – Biarys May 30 '18 at 03:47
selenium loads the full page, but the soup won't download all of it. It goes to 30 max and stops – P_n May 30 '18 at 04:01

Selenium scrolling and scraping with BeautifulSoup produces duplicate results

2 Answers2

Linked