Python selenium get contents of a webpage added by javascript

Question

I use an online music player called "Netease Cloud Music", and I have multiple playlists in my account, they hold thousands of tracks and are very poorly organized and categorized and held duplicate entries, so I want to export them into an SQL table to organize them.

I have found a way to view the playlists without using the client software, that is, clicking the share button on top of the playlist page and then click "copy link".

But opening the link in any browser other than the client, the playlist will be limited to 1000 tracks.

But I have found a way to overcome it, I installed Tampermonkey and then installed this script.

Now I can view full playlists in a browser.

This is a sample playlist.

The playlists look like this:

The first column holds the songtitle, the second column holds the duration, the third column holds the artist, and the last column holds the album.

The text in the first, third and fourth columns are hyperlinks to the song, artist and album pages respectively.

I don't know a thing about html but I managed to get its data structure.

The thing we need is the table located at xpath //table/tbody, each row is a childnode of the table named tr(xpath //table/tbody/tr).

this is a sample row:

<td class="left">
    <div class="hd "><span data-res-id="5221710" data-res-type="18" data-res-action="play" data-res-from="13" data-res-data="158624364" class="ply ">&nbsp;</span><span class="num">1</span></div>
</td>
<td>
    <div class="f-cb">
        <div class="tt">
            <div class="ttc">
                <span class="txt">
                    <a href="#/song?id=5221710"><b title="Axel F">Axel F</b></a>
                    
                    
                </span>
            </div>
        </div>
    </div>
</td>
<td class=" s-fc3">
    <span class="u-dur candel">03:00</span>
    <div class="opt hshow">
        <a class="u-icn u-icn-81 icn-add" href="javascript:;" title="添加到播放列表" hidefocus="true" data-res-type="18" data-res-id="5221710" data-res-action="addto" data-res-from="13" data-res-data="158624364"></a>
        <span data-res-id="5221710" data-res-type="18" data-res-action="fav" class="icn icn-fav" title="收藏"></span>
        <span data-res-id="5221710" data-res-type="18" data-res-action="share" data-res-name="Greatest Hits Of The Millennium 80's Vol.2" data-res-author="Harold Faltermeyer" data-res-pic="https://p2.music.126.net/tOa6Tizqy755OZE7ITsw_g==/775155697626111.jpg" class="icn icn-share" title="分享">分享</span>
        <span data-res-id="5221710" data-res-type="18" data-res-action="download" class="icn icn-dl" title="下载"></span>
        <span data-res-id="5221710" data-res-type="18" data-res-from="13" data-res-data="158624364" data-res-action="delete" class="icn icn-del" title="删除">删除</span>
    </div>
</td>
<td>
    <div class="text" title="Harold Faltermeyer">
        <span title="Harold Faltermeyer">
            <a href="#/artist?id=34854" hidefocus="true">Harold Faltermeyer</a>
        </span>
    </div>
</td>
<td>
    <div class="text">
        <a href="#/album?id=509819" title="Greatest Hits Of The Millennium 80's Vol.2">Greatest Hits Of The Millennium 80's Vol.2</a>
    </div>
</td>

The columns are childnodes of the element.

I have managed to get the xpaths corresponding to the columns:

/td[2]/div/div/div/span/a/b -->  title
/td[2]/div/div/div/span/a -->  song link
/td[3]/span -->  duration
/td[4]/div/span/a -->  artist
/td[4]/div/span/a['href'] -->  artist link
/td[5]/div/a -->  album
/td[5]/div/a['href'] -->  album link

We should add the address music.163.com/ in front of the links to get full addresses.

I was thinking about using selenium to get the elements, more specifically find the rows by xpath, then loop through the rows and get the columns by their xpaths inside the rows, then add the values to a list of namedtuples.

From here it is trivial to add the elements to an SQL table.

But I just can't get it to work.

I have managed to open a Firefox selenium window, install tampermonkey and the script to access the full playlist(these two installations are done manually), then get to the playlist page and tried to get the elements:

from selenium import webdriver
Firefox = webdriver.Firefox()
Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')
Firefox.find_elements_by_xpath('//table/tbody/tr')

The result is an empty list.

I don't know what went wrong, I can view the table element in developer tools just fine, then I have viewed its source code and realized that the table isn't in its source code.

I have even managed to obtain the full table with developer tools, and I uploaded it here.

But it is invisible to selenium. Apparently browsers have a way to display contents not in the original html source code and selenium can't. That's when I realized browsers can execute javascript and the additional contents not in the original source code are probably added by a javascript somewhere, and the code I used didn't involve javascript and can only get the original source code without the additional contents.

I tried Googling python selenium get contents of a webpage added by javascript, but it isn't helping.

So I have two questions, first, in the short term, how can I use some html parsing library to parse a piece of html code locally stored in a txt file?

And second, in the long term, how can I use selenium or any other Python html library to get complete source code of a webpage with additional contents added by javascript instead of only the original source code without the additional contents, so that I don't need to export the elements manually every time?

Selenium can display elements loaded by javascript. Maybe you are not waiting enough for the table to get loaded. Have you tried putting some delay before scraping the table? — Roy, Jun 07 '21 at 06:06
I checked the link given here. We have two problems here. Without logging in, it gives only 6 songs. So even if we are able to get the table details we will only get 6 songs. So we will need to log in first. When you are checking from your browser it is already logged in and therefore giving you the full playlist. The second problem is the table is inside an iframe. So first you need to switch to iframe and then try to scrape the data. — Roy, Jun 07 '21 at 06:27

Prophet · Accepted Answer · 2021-06-07T07:39:55.270

The simplest answer is that you have to add some delay after opening the page with Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751') before getting the elements with Firefox.find_elements_by_xpath('//table/tbody/tr') to let the elements on the page loaded. It takes few moments.
So, you can simply add a kind of time.sleep(5) there.
The better approach is to use expected conditions instead.
Something like this:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Firefox = webdriver.Firefox()

# Wait for initialize, in seconds
wait = WebDriverWait(Firefox, 20)

Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')

wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))

Firefox.find_elements_by_xpath('//table/tbody/tr')

UPD
There is an iframe there, so you need to switch to that iframe as following:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Firefox = webdriver.Firefox()

# Wait for initialize, in seconds
wait = WebDriverWait(Firefox, 20)

Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')

iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')
driver.switch_to.frame(iframe)

wait.until(EC.visibility_of_element_located((By.XPATH, '//table/tbody/tr')))

Firefox.find_elements_by_xpath('//table/tbody/tr')

This should work now. If not add some delay before `iframe = driver.find_element_by_xpath('//iframe[@id="g_iframe"]')` but normally it should work without it — Prophet, Jun 07 '21 at 06:34
Just a little fyi, executing this line: `wait = WebDriverWait(browser, 20)` will prompt this error: `NameError: name 'browser' is not defined`, I of course know it should be Firefox, I believe it was mistyped but please correct it just in case some other people who see the code and don't know how to correct that line. — , Jun 07 '21 at 07:28
BTW its a kind of convention to name the webdriver object instance as "driver". Think about a situation when you need your automation to work with Chrome too. You will switch it simply with `Firefox = webdriver.ChromeDriver()` but the driver instance will remain Firefox in the entire code... Also it's a convention to name object instances names starting with a lowercase name :) — Prophet, Jun 07 '21 at 07:45

Python selenium get contents of a webpage added by javascript

1 Answers1

Linked