0

Please Note this question remains opened, as the suggested "answer" still gives same output since it doesn't explain why JS isn't running on that page or why selenium can't extract it

I'm trying to read page source of: http://147.235.97.36/ (Hp printer) which is rendered by JS.

So I wrote:

driver.get(url)
wait_for_page(driver)
source = driver.page_source
print(source)

but in the printed source I see:

<p>JavaScript is required to access this website.</p>

<p>Please enable JavaScript or use a browser that supports JavaScript.</p>

and some of the content isn't there, so I changed my code to:

driver.get(url)
wait_for_page(driver)
source = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print(source)

Still same output, can you help me understand what's the problem here?

Here is my init_driver function:

def init_driver():
    # --Initialize Driver--#
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run in Background
    chrome_options.add_argument('--disable-gpu') if os.name == 'nt' else None  # Windows workaround
    prefs = {"profile.default_content_settings.images": 2,
             "profile.managed_default_content_settings.images": 2}  # Disable Loading of Images
    chrome_options.add_experimental_option("prefs", prefs)
    chrome_options.add_argument('--ignore-ssl-errors=yes')
    chrome_options.add_argument('--ignore-certificate-errors')
    chrome_options.add_argument("--window-size=1920,1080")  # Standard Window Size
    chrome_options.add_argument("--pageLoadStrategy=normal")
    driver = None
    try:
        driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
        driver.set_page_load_timeout(REQUEST_TIMEOUT)
    except Exception as e:
        log_warning(str(e))
    return driver
John
  • 11
  • 1
  • Do you guys work together on this scrape HP printer project? https://stackoverflow.com/questions/72514422/how-to-read-js-generated-page-in-python – baduker Jun 08 '22 at 11:31
  • Weird. Anyhow, the answer I gave there also answers your question. – baduker Jun 08 '22 at 12:08
  • Hi, Thank you but I'm using selenium, and your answer didn't help me understand what I'm doing wrong. from all posts I read this is how to read content generated by JS code – John Jun 08 '22 at 12:18
  • I have waited for the page to load completely so what did I do wrong here? – John Jun 08 '22 at 12:19
  • Oh I forgot to mention this isn't supposed to work only for HP printers, I'm looking for general solution. – John Jun 08 '22 at 12:20

1 Answers1

0

You can add a few arguments to avoid geting detected and print the Page Source as follows:

  • Code Block:

    options = Options()
    options.add_argument("start-maximized")
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('excludeSwitches', ['enable-logging'])
    options.add_experimental_option('useAutomationExtension', False)
    options.add_argument('--disable-blink-features=AutomationControlled')
    s = Service('C:\\BrowserDrivers\\chromedriver.exe')
    driver = webdriver.Chrome(service=s, options=options)
    driver.get("http://147.235.97.36/")
    print(driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML"))
    
  • Console Output:

    <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="viewport" content="width=device-width, initial-scale=1">
    
      <link href="/framework/Unified.css" rel="stylesheet" type="text/css">
    
      <script type="text/javascript">
      frameWorkObj = {};
      frameWorkObj.pkg = "ews";
      </script>
    
      <script src="/framework/Unified.js" type="text/javascript"></script>
    </head>
    
    <body class="theme-gray">
    <iframe src="/framework/cookie/client/cookie.html" style="display: none;"></iframe>
    
    <div id="pgm-overall-container">
      <div id="pgm-left-pane-bkground"></div>
      <div id="pgm-banner"></div>
      <div id="pgm-search-div" class="gui-hidden"></div>
      <div id="pgm-top-pane"></div>
    
      <div id="pgm-container-div">
        <div id="pgm-left-pane"></div>
        <div id="pgm-container" class="clear-fix">
          <div id="pgm-title-div" class="gui-hidden"></div>
          <div id="contentPane" class="contentPane"></div>
        </div>
      </div>
    
      <div id="pgm-footer"></div>
    </div> <!-- #pgm-overall-container -->
    
    <div id="pgm-theatre-staging-div"></div>
    
    <script type="text/javascript">
    // frame buster
    if(top != self)
      top.location.replace(self.location.href);
    </script>
    
    <noscript>
    <div id="pgm-no-js-text">
    <p>JavaScript is required to access this website.</p>
    
    <p>Please enable JavaScript or use a browser that supports JavaScript.</p>
    </div>
    </noscript>
    
    
    <div id="ui-datepicker-div" style="display: none;" tabindex="0"></div></body>
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
  • 1) How is my problem related to getting detected or not (as the output says JS isn't enabled) 2) Please Note you still hasn't solved the problem as the output when doing inspect element is much more rich for example it has: 'id="top-cat-Tool"' – John Jun 08 '22 at 21:29
  • This doesn't answer my question... – John Jun 11 '22 at 15:52