1

I try to scrap my deezer music but, when I scroll the site, selenium skips a lot of music, selenium skips the first 30 music, displays 10, then skips another 30, etc. until the end of the page.

Here is the code:

import selenium
from selenium import webdriver
path   = "./chromedriver"
driver = webdriver.Chrome(executable_path=path)
url = 'https://www.deezer.com/fr/playlist/2560242784'
driver.get(url)

for i in range(0,20):
    try :

        driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
        musics = driver.find_elements_by_class_name('BT3T6')
        for music in musics:
            print (music.text)
    except Exception as e:
        print(e)
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83

1 Answers1

1

I've tried to scrape the page based on your code and ended up with success.

I've decided to scroll the page by 500px per step and then remove all duplications and empty strings.

import selenium
import time
from selenium import webdriver
path   = "./chromedriver"
driver = webdriver.Chrome(executable_path=path)
url = 'https://www.deezer.com/fr/playlist/2560242784'
driver.get(url)

all_music = []
last_scroll_y = driver.execute_script("return window.scrollY")
for i in range(0, 100):
    try :
        #first scrape
        musics = driver.find_elements_by_class_name('BT3T6')
        for music in musics:
            all_music.append(music.text)
        #then scroll down +500px
        driver.execute_script("window.scrollTo(0, window.scrollY+500);")
        time.sleep(0.2) #some wait for the new content (200ms)
        current_scroll_y = driver.execute_script("return window.scrollY")
        
        # exit the loop if the page is not scrolled any more
        if current_scroll_y == last_scroll_y:
            break
        last_scroll_y = current_scroll_y
    except Exception as e:
        print(e)

# this removes all empty strings
all_music = list(filter(None, all_music))

# this removes all duplications, but keeps the order
# based on https://stackoverflow.com/a/17016257/5226491
# python 3.7 required
all_music = list(dict.fromkeys(all_music))

# this also removes all duplications, but the order will be changed
#all_music = list(set(all_music))

for m in all_music:
    print(m)
    
print('Total music found: ' + len(all_music))

This works ~ 60-90 seconds and scrape 1000+ items.

Note: it works fine with the active window, and also works in headless mode, but it finish scraping when I collapse the browser window.. So run this with headless chrome option

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(CHROMEDRIVER_PATH, options=options)

or do not collapse the window.

Max Daroshchanka
  • 2,698
  • 2
  • 10
  • 14
  • You could also make it into a set to remove duplicates. – Arundeep Chohan Jan 30 '22 at 23:32
  • And out of curiosity, do you know why there is this skipping problem with this method ? (document.body.scrollHeight) – Antoine Croissant Jan 31 '22 at 10:03
  • @AntoineCroissant I've found that there is always about 20 `class_name('BT3T6')` records that might be found in a time. And when the new items are loaded, the previous items become inaccessible. So, `document.body.scrollHeight` scroll loads too many new items. The scroll step is too big. – Max Daroshchanka Jan 31 '22 at 10:19
  • OK, thank you. And last thing if you know, is there any other way to remove duplicates than by if they have the same name? Because it creates confusion, some music have the same title. Indeed there is no ID for the musics what makes the thing more complex. Thanks in advance – Antoine Croissant Jan 31 '22 at 11:09
  • @AntoineCroissant you can collect the text for this class: `ZLI1L`, this will scrape the full table row. After removing all duplications you can extract the song name from the row text. I see the output text is like `Réalité augmentée\nE\nNekfeu\nCyborg\n14/12/2016\n04:10`, so you might split it by `\n` and take the first item. – Max Daroshchanka Jan 31 '22 at 11:17
  • @AntoineCroissant You can accept the answer if this was helpful enough. – Max Daroshchanka Jan 31 '22 at 11:51