Following my previous question, I have succeeded in some small parts of my task.
This is what I have put together so far:
import os
from collections import namedtuple
from operator import itemgetter
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
song = namedtuple('song', 'artist title album duration artistlink songlink albumlink')
path = os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles'
path = (path + '\\' + os.listdir(path)[0]).replace('\\', '/')
profile = webdriver.FirefoxProfile(path)
Firefox = webdriver.Firefox(profile)
wait = WebDriverWait(Firefox, 30)
Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')
iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)
wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))
rows = Firefox.find_elements_by_xpath('//table/tbody/tr')
entries = []
for row in rows:
column1 = row.find_element_by_xpath('td[2]/div/div/div/span/a')
title = column1.text
songlink = column1.get_attribute('href')
duration = row.find_element_by_xpath('td[3]/span').text
column3 = row.find_element_by_xpath('td[4]/div/span/a')
artist = column3.text
artistlink = column3.get_attribute('href')
column4 = row.find_element_by_xpath('td[5]/div/a')
album = column4.text
albumlink = column4.get_attribute('href')
entries.append(song(artist, title, album, duration, artistlink, songlink, albumlink))
The wait is a must, because the javascript takes some time to load all those entries, if the table is scraped too early there will only be 1000 songs at most.
I am concerned about the loop part, it takes more than three minutes to process just 2748 entries.
This line:
rows = Firefox.find_elements_by_xpath('//table/tbody/tr')
It gets the entire table pretty fast(under three seconds), but I don't know why using multiple find_element_by_xpath()
and get_attribute()
in a loop makes the code run slow.
Is calling these methods these many times in a short time period too taxing for the browser, or creating named tuple is inherently slow?
How can it be optimized?