0

I am trying to scrape the products information from vmware official website using selenium plus scrapy. But I never can load the page completely with code, even with longer waiting time. Here is my script.

class VmwareSpiderSpider(scrapy.Spider):

    name = 'vmware_spider'
    allowed_domains = ['customerconnect.vmware.com']
    start_urls = [
    'https://customerconnect.vmware.com/en/downloads/details?downloadGroup=NSX-4011&productId=1339#product_downloads']

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
   
        self.driver.implicitly_wait(30)
        wait = WebDriverWait(self.driver, 120, poll_frequency=5)
        wait.until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT, "Read More")))  
  
        with open("source.html", "w") as f:
            f.write(self.driver.page_source)


        self.driver.quit()

I am not familiar with web page design and architecture, so I have a few questions:

  1. If I have 20 items which contain "Read More", how can I make sure all 20 items are loaded before I start locating elements.
  2. In the original web page, the read more class has an onclick attribute. But in the page source I retrieved using selenium, the attribute disappeared. So the click points to nowhere. What causes this problem?

Any hints will be appreciated. Thanks a lot.

Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
Q Yang
  • 310
  • 1
  • 3
  • 14

1 Answers1

1

All the required datum are loaded via API calls json response as get method. If you press F12 then you will find network tab which is selected and refresh the url from far top left circular icon and click on XHR,name, headers, preview and you will get everything about API url

import scrapy
import json
 
API_URL = "https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339"

class VmwareSpiderSpider(scrapy.Spider):
    name = "vm"
    start_urls = [API_URL]
        
    custom_settings = {
        'USER_AGENT' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
    }
 
    def parse(self, response):
        json_response = json.loads(response.text)
        datas = json_response["downloadFiles"]
        for data in datas:
            yield {
                "title":data.get("title"),
                "fileName": data.get('fileName'),
                "releaseDate": data.get("releaseDate"),
                "build": data.get("build")
                
                }

Output:

{'title': 'NSX Manager/ NSX Global Manager / NSX Cloud Service Manager for VMware ESXi', 'fileName': 'nsx-unified-appliance-4.0.1.1.0.20598732.ova', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Manager with vCenter Plugin', 'fileName': 'nsx-embedded-unified-appliance-4.0.1.1.0.20598732.ova', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Application Platform', 'fileName': 'VMware-NSX-Application-Platform-4.0.1.0.0.20606727.tgz', 'releaseDate': '2022-10-13', 'build': '20606727'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'Kubernetes-tools 1.21', 'fileName': 'kubernetes-tools-1.21.9-00_3.8.0-1.tar.gz', 'releaseDate': '2022-10-13', 'build': '20596968'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'Kubernetes-tools 1.23', 'fileName': 'kubernetes-tools-1.23.3-00_3.8.0-1.tar.gz', 'releaseDate': '2022-10-13', 'build': '20596968'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX SVM Appliance', 'fileName': 'VMware-NSX-Malware-Prevention-appliance-4.0.1.1.0.20598729.ova', 'releaseDate': '2022-10-13', 'build': '20598729'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Edge for Bare Metal', 'fileName': 'nsx-edge-4.0.1.1.0.20598735.iso', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Edge for VMware ESXi', 'fileName': 'nsx-edge-4.0.1.1.0.20598735.ova', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Kernel Module for VMware ESXi 7.0', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-esx70.zip', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Kernel Module for VMware ESXi 8.0', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-esx80.zip', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'Standalone Edge - Client', 'fileName': 'nsx-l2vpn-client-ovf-19300606.tar.gz', 'releaseDate': 
'2022-10-13', 'build': '19307994'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': None, 'fileName': None, 'releaseDate': None, 'build': None}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX 4.0.1.1 Upgrade Bundle', 'fileName': 'VMware-NSX-upgrade-bundle-4.0.1.1.0.20598726.mub', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX Cloud Upgrade Bundle for NSX-T 4.0.1.1', 'fileName': 'VMware-CC-upgrade-bundle-4.0.1.1.0.20598726.mub', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'Upgrade bundle for NSX-T L2 VPN Client Appliance', 'fileName': 'VMware-NSX-edge-4.0.1.1.0.20598735.nub', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': None, 'fileName': None, 'releaseDate': None, 'build': None}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.6 / CentOS 7.6 / OEL 7.6', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel76_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.6 Container', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-container-rhel76_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.7 / CentOS 7.7 / OEL 7.7', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel77_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.7 Container', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-container-rhel77_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.8 / CentOS 7.8 / OEL 7.8', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel78_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.8 Container', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-container-rhel78_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 7.9 / CentOS 7.9 / OEL 7.9', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel79_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 8.0 / CentOS 8.0', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel80_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for RHEL 8.3 / CentOS 8.3', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-rhel83_x86_64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for SUSE SLES 12sp3', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-linux64-sles12sp3.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for SUSE SLES 12sp4', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-linux64-sles12sp4.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for Ubuntu 16.04', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-ubuntu-xenial_amd64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for Ubuntu 18.04', 'fileName': 'nsx-lcp-4.0.1.1.0.20598730-baremetal-server-linux64-bionic_amd64.tar.gz', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.scraper] DEBUG: Scraped from <200 https://customerconnect.vmware.com/channel/public/api/v1.0/dlg/details?locale=en_US&downloadGroup=NSX-4011&productId=1339>
{'title': 'NSX BM Server Module for Windows 2016/ 2019', 'fileName': 'nsx-lcp-4.0.1.20598730-baremetal-server-win32_vs2017.zip', 'releaseDate': '2022-10-13', 'build': '20598726'}
2022-10-31 05:03:00 [scrapy.core.engine] INFO: Closing spider (finished)
2022-10-31 05:03:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 389,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 9030,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 1.133655,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 10, 30, 23, 3, 0, 669723),
 'httpcompression/response_bytes': 23588,
 'httpcompression/response_count': 1,
 'item_scraped_count': 30,
Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • 1
    It is amazing! Could you tell me where to find `API_URL` in your code? – Q Yang Oct 30 '22 at 23:04
  • 1
    @ Q Yang,Thanks, You can find a couple of discussions about API from here: https://stackoverflow.com/questions/1820927/request-monitoring-in-chrome/3019085#3019085 – Md. Fazlul Hoque Oct 30 '22 at 23:12