0

I am new to python & beautiful soup,need to web scrape all the links to index it in elastic search,I am using below code to get all links/sublinks inside the info page,but unable to retrieve it.

 from bs4 import BeautifulSoup
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

    urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
urlAll = soup.find_all("a")
for links in soup.find_all('a'):
     print (links.get('href'))

unable to retrieve any links/sublinks as print() is not giving any o/p

please provide some pointers.

Anand
  • 621
  • 3
  • 9
  • 31
  • Possible duplicate of [How to get rid of BeautifulSoup user warning?](http://stackoverflow.com/questions/33511544/how-to-get-rid-of-beautifulsoup-user-warning) – Teemu Risikko Feb 28 '17 at 14:25
  • not about duplicate,but how to proceed with these type of urls,i am able to get result with a normal url like https://www.vmware.com/support/pubs/ tried using below,dint giving nay result: soup = BeautifulSoup(urlHtml,"html.parser") – Anand Feb 28 '17 at 14:26

1 Answers1

1

The data that you want is being loaded via an ajax call.

Replace

http://pubs.vmware.com/sddc-mgr-12/index.jsp#com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html

With

http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment

And change the find_all element type to node:

from bs4 import BeautifulSoup
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2

    urlFile = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment")

urlHtml = urlFile.read()
urlFile.close()

soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
     print (links.get('href'))

Which outputs:

../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html
../topic/com.vmware.vcf.ovdeploy.doc_21/GUID-F2DCF1B2-4EF6-444E-80BA-8F529A6D0725.html
../topic/com.vmware.vcf.admin.doc_211/GUID-D5A44DAA-866D-47C9-B1FB-BF9761F97E36.html
../topic/com.vmware.ICbase/PDF/ic_pdf.html

Please note that every time you click on an left panel item it fires an ajax call to populate the list. For Example:

http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/com.vmware.evosddc.via.doc_211/toc.xml

Take note of this particular url fragment as an example: com.vmware.evosddc.via.doc_211 - You see that you'll need to get that part from the first output to get the second output and so on.

Example:

soup = BeautifulSoup(urlHtml,"html.parser")
for links in soup.find_all('node'):
    child_url = links.get('href').replace("../topic/", "")
    child = urllib2.urlopen("http://pubs.vmware.com/sddc-mgr-12/advanced/tocfragment?toc=/" + child_url[0:child_url.index("/")])
    print (child.read())
    #print (links.get('href'))

Which outputs

<?xml version="1.0" encoding="UTF-8"?>
<tree_data>
<node
      path="0"
      title="VIA User&apos;s Guide"
      id="/com.vmware.evosddc.via.doc_211/toc.xml"
      href="../topic/com.vmware.evosddc.via.doc_211/GUID-71BE2329-4B96-4B18-9FF4-1BC458446DB2.html"
      image="toc_closed">
</node>

...
Zroq
  • 8,002
  • 3
  • 26
  • 37
  • Thanks for reply,but in the left panel,there are many links and sublinks(child links),I want to get that...how to get these all.... – Anand Feb 28 '17 at 14:43
  • i get some approaches,but seems i hv to use something like selenium or anything to get all the nested sublinks... – Anand Mar 01 '17 at 13:46