0

Following my previous question, I have succeeded in some small parts of my task.

This is what I have put together so far:

import os
from collections import namedtuple
from operator import itemgetter
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

song = namedtuple('song', 'artist title album duration artistlink songlink albumlink')

path = os.environ['APPDATA'] + '\Mozilla\Firefox\Profiles'
path = (path + '\\' + os.listdir(path)[0]).replace('\\', '/')
profile = webdriver.FirefoxProfile(path)

Firefox = webdriver.Firefox(profile)
wait = WebDriverWait(Firefox, 30)

Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')

iframe = Firefox.find_element_by_xpath('//iframe[@id="g_iframe"]')
Firefox.switch_to.frame(iframe)

wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))

rows = Firefox.find_elements_by_xpath('//table/tbody/tr')

entries = []

for row in rows:
    column1 = row.find_element_by_xpath('td[2]/div/div/div/span/a')
    title = column1.text
    songlink = column1.get_attribute('href')
    duration = row.find_element_by_xpath('td[3]/span').text
    column3 = row.find_element_by_xpath('td[4]/div/span/a')
    artist = column3.text
    artistlink = column3.get_attribute('href')
    column4 = row.find_element_by_xpath('td[5]/div/a')
    album = column4.text
    albumlink = column4.get_attribute('href')
    entries.append(song(artist, title, album, duration, artistlink, songlink, albumlink))

The wait is a must, because the javascript takes some time to load all those entries, if the table is scraped too early there will only be 1000 songs at most.

I am concerned about the loop part, it takes more than three minutes to process just 2748 entries.

This line:

rows = Firefox.find_elements_by_xpath('//table/tbody/tr')

It gets the entire table pretty fast(under three seconds), but I don't know why using multiple find_element_by_xpath() and get_attribute() in a loop makes the code run slow.

Is calling these methods these many times in a short time period too taxing for the browser, or creating named tuple is inherently slow?

How can it be optimized?

  • `//table/tbody/tr` it just shows 6 entries to me. How is it fetching more than 2K items for you ? – cruisepandey Jun 07 '21 at 09:39
  • @cruisepandey Like I wrote in the question body you need to install this https://greasyfork.org/en/scripts/406054-%E7%BD%91%E6%98%93%E4%BA%91%E9%9F%B3%E4%B9%90%E6%98%BE%E7%A4%BA%E5%AE%8C%E6%95%B4%E6%AD%8C%E5%8D%95 in tampermonkey to lift the restrictions. –  Jun 07 '21 at 09:44

1 Answers1

0

It's not about your code speed rather about it correctness.
Inside the for loop you are trying to search inside the specific row each time but I'm not sure you get what you intended.
When searching for a sub-element inside some parent node element you should start your XPath expression with . saying "from here", staring from this node element. Otherwise you will search with you relative XPath like td[2]/div/div/div/span/a relatively to the entire web page.
Here you can see this explanation.
Please try this and tell me if that made some changes:

for row in rows:
    column1 = row.find_element_by_xpath('.//td[2]/div/div/div/span/a')
    title = column1.text
    songlink = column1.get_attribute('href')
    duration = row.find_element_by_xpath('.//td[3]/span').text
    column3 = row.find_element_by_xpath('.//td[4]/div/span/a')
    artist = column3.text
    artistlink = column3.get_attribute('href')
    column4 = row.find_element_by_xpath('.//td[5]/div/a')
    album = column4.text
    albumlink = column4.get_attribute('href')
    entries.append(song(artist, title, album, duration, artistlink, songlink, albumlink))
Prophet
  • 32,350
  • 22
  • 54
  • 79
  • Thank you for your answer, but it isn't about performance. The more correct xpaths are appreciated, but I did get what I wanted with my xpaths. –  Jun 07 '21 at 12:50
  • OK, I just tried to help with what I know. As about the performance I currently have no idea, sorry – Prophet Jun 07 '21 at 12:53