The data that you want is being loaded via an ajax call.
Replace
http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
With
http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment
And change the find_all
element type to node
:
from bs4 import BeautifulSoup
try:
import urllib.request as urllib2
except ImportError:
import urllib2
urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment")
urlHtml = urlFile.read()
urlFile.close()
soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
print (links.get('href'))
Which outputs:
../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
../topic/com.vmware.vcf.ovdeploy.doc_21/GUID-F2DCF1B2-4EF6-444E-80BA-8F529A6D0725.html
../topic/com.vmware.vcf.admin.doc_211/GUID-D5A44DAA-866D-47C9-B1FB-BF9761F97E36.html
../topic/com.vmware.ICbase/PDF/ic_pdf.html
Please note that every time you click on an left panel item it fires an ajax call to populate the list. For Example:
http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/com.vmware.evosddc.via.doc_211/toc.xml
Take note of this particular url fragment as an example: com.vmware.evosddc.via.doc_211
- You see that you'll need to get that part from the first output to get the second output and so on.
Example:
soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
child_url = links.get('href').replace("../topic/", "")
child = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/" + child_url[0:child_url.index("/")])
print (child.read())
#print (links.get('href'))
Which outputs
<?xml version="1.0" encoding="UTF-8"?>
<tree_data>
<node
path="0"
title="VIA User's Guide"
id="/com.vmware.evosddc.via.doc_211/toc.xml"
href="../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html"
image="toc_closed">
</node>
...