I use an online music player called "Netease Cloud Music", and I have multiple playlists in my account, they hold thousands of tracks and are very poorly organized and categorized and held duplicate entries, so I want to export them into an SQL table to organize them.
I have found a way to view the playlists without using the client software, that is, clicking the share button on top of the playlist page and then click "copy link".
But opening the link in any browser other than the client, the playlist will be limited to 1000 tracks.
But I have found a way to overcome it, I installed Tampermonkey and then installed this script.
Now I can view full playlists in a browser.
This is a sample playlist.
The playlists look like this:
The first column holds the songtitle, the second column holds the duration, the third column holds the artist, and the last column holds the album.
The text in the first, third and fourth columns are hyperlinks to the song, artist and album pages respectively.
I don't know a thing about html but I managed to get its data structure.
The thing we need is the table located at xpath //table/tbody
, each row is a childnode of the table named tr(xpath //table/tbody/tr
).
this is a sample row:
<td class="left">
<div class="hd "><span data-res-id="5221710" data-res-type="18" data-res-action="play" data-res-from="13" data-res-data="158624364" class="ply "> </span><span class="num">1</span></div>
</td>
<td>
<div class="f-cb">
<div class="tt">
<div class="ttc">
<span class="txt">
<a href="#/song?id=5221710"><b title="Axel F">Axel F</b></a>
</span>
</div>
</div>
</div>
</td>
<td class=" s-fc3">
<span class="u-dur candel">03:00</span>
<div class="opt hshow">
<a class="u-icn u-icn-81 icn-add" href="javascript:;" title="添加到播放列表" hidefocus="true" data-res-type="18" data-res-id="5221710" data-res-action="addto" data-res-from="13" data-res-data="158624364"></a>
<span data-res-id="5221710" data-res-type="18" data-res-action="fav" class="icn icn-fav" title="收藏"></span>
<span data-res-id="5221710" data-res-type="18" data-res-action="share" data-res-name="Greatest Hits Of The Millennium 80's Vol.2" data-res-author="Harold Faltermeyer" data-res-pic="https://p2.music.126.net/tOa6Tizqy755OZE7ITsw_g==/775155697626111.jpg" class="icn icn-share" title="分享">分享</span>
<span data-res-id="5221710" data-res-type="18" data-res-action="download" class="icn icn-dl" title="下载"></span>
<span data-res-id="5221710" data-res-type="18" data-res-from="13" data-res-data="158624364" data-res-action="delete" class="icn icn-del" title="删除">删除</span>
</div>
</td>
<td>
<div class="text" title="Harold Faltermeyer">
<span title="Harold Faltermeyer">
<a href="#/artist?id=34854" hidefocus="true">Harold Faltermeyer</a>
</span>
</div>
</td>
<td>
<div class="text">
<a href="#/album?id=509819" title="Greatest Hits Of The Millennium 80's Vol.2">Greatest Hits Of The Millennium 80's Vol.2</a>
</div>
</td>
The columns are childnodes of the element.
I have managed to get the xpaths corresponding to the columns:
/td[2]/div/div/div/span/a/b --> title
/td[2]/div/div/div/span/a --> song link
/td[3]/span --> duration
/td[4]/div/span/a --> artist
/td[4]/div/span/a['href'] --> artist link
/td[5]/div/a --> album
/td[5]/div/a['href'] --> album link
We should add the address music.163.com/
in front of the links to get full addresses.
I was thinking about using selenium to get the elements, more specifically find the rows by xpath, then loop through the rows and get the columns by their xpaths inside the rows, then add the values to a list of namedtuples.
From here it is trivial to add the elements to an SQL table.
But I just can't get it to work.
I have managed to open a Firefox selenium window, install tampermonkey and the script to access the full playlist(these two installations are done manually), then get to the playlist page and tried to get the elements:
from selenium import webdriver
Firefox = webdriver.Firefox()
Firefox.get('https://music.163.com/#/playlist?id=158624364&userid=126762751')
Firefox.find_elements_by_xpath('//table/tbody/tr')
The result is an empty list.
I don't know what went wrong, I can view the table element in developer tools just fine, then I have viewed its source code and realized that the table isn't in its source code.
I have even managed to obtain the full table with developer tools, and I uploaded it here.
But it is invisible to selenium. Apparently browsers have a way to display contents not in the original html source code and selenium can't. That's when I realized browsers can execute javascript and the additional contents not in the original source code are probably added by a javascript somewhere, and the code I used didn't involve javascript and can only get the original source code without the additional contents.
I tried Googling python selenium get contents of a webpage added by javascript, but it isn't helping.
So I have two questions, first, in the short term, how can I use some html parsing library to parse a piece of html code locally stored in a txt file?
And second, in the long term, how can I use selenium or any other Python html library to get complete source code of a webpage with additional contents added by javascript instead of only the original source code without the additional contents, so that I don't need to export the elements manually every time?