3

I am trying to scrape Phone Number from these links "https://www.practo.com/delhi/doctor/dr-meeka-gulati-dentist-3?specialization=Dentist&practice_id=722421" and "https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154"

if element present it scrapes the phone number otherwise phone number is None

Spider Code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_argument('window-size=1200x600')

driver = webdriver.Chrome(chrome_options=options)

driver.get('https://www.practo.com/delhi/doctor/dr-meeka-gulati-dentist-3?specialization=Dentist&practice_id=722421')

WebDriverWait(driver, 10).until(
                            EC.presence_of_element_located((By.XPATH, "//p[@data-a-target='carousel-broadcaster-displayname']"))
                            )
try:
    next1 = driver.find_element_by_xpath('//*[@class="c-btn--light c-btn--center"]')
    next1.click()

    next2 = driver.find_element_by_xpath('//*[@class="u-title-font icon-ic_call_filled u-valign--middle"]')
    next2.click()
    phone_number = driver.find_element_by_class_name('c-vn__number').get_attribute('innerHTML')
except NoSuchElementException:
    phone_number = None

print(phone_number)

Output

DevTools listening on ws://127.0.0.1:60482/devtools/browser/9f226a40-2d1a-4108-9fde-f005b49e60b3
[1206/102937.475:INFO:CONSOLE(0)] "[Report Only] Refused to load the script 
'https://www.googletagmanager.com/gtag/js?id=AW-942004674' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' 'nonce-3RJz12sDPuoV27qS7dcBXLRZawmPobLo' *.practo.com *.practostatic.com *.onesignal.com *.mxpnl.com *.mixpanel.com *.facebook.com *.facebook.net *.twitter.com *.gstatic.com *.googleapis.com *.google.com *.googlesyndication.com *.newrelic.com *.google-analytics.com *.googletagmanager.com *.googleadservices.com *.googlesyndication.com *.doubleclick.net *.survicate.com in.wzrkt.com *.nr-data.net *.newrelic.com *.speedcurve.com *.ampproject.org *.netcore.co.in *.netcoresmartech.com *.criteo.net *.criteo.com https://secure.livechatinc.com". 'strict-dynamic' is present, so host-based whitelisting is disabled. Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.

", source: https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154 (0)
[1206/125829.645:INFO:CONSOLE(33)] "[Report Only] Refused to execute inline script because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' 'nonce-eNRfqc27QHPklLLhavu92zuUGDeEoSZL' *.practo.com *.practostatic.com *.onesignal.com *.mxpnl.com *.mixpanel.com *.facebook.com *.facebook.net *.twitter.com *.gstatic.com *.googleapis.com *.google.com *.googlesyndication.com *.newrelic.com *.google-analytics.com *.googletagmanager.com *.googleadservices.com *.googlesyndication.com *.doubleclick.net *.survicate.com in.wzrkt.com *.nr-data.net *.newrelic.com *.speedcurve.com *.ampproject.org *.netcore.co.in *.netcoresmartech.com *.criteo.net *.criteo.com https://secure.livechatinc.com". Note that 'unsafe-inline' is ignored if either a hash or nonce value is present in the source list.
    ", source: https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154 (33)
[1206/125829.829:INFO:CONSOLE(0)] "[Report Only] Refused to frame 'https://9535906.fls.doubleclick.net/' because it violates the following Content Security Policy directive: "frame-src 'self' https://survicate.com *.practo.com *.criteo.net *.criteo.com https://www.facebook.com https://bid.g.doubleclick.net https://secure.livechatinc.com".
", source: https://www.googletagmanager.com/ (0)
[1206/125830.508:INFO:CONSOLE(0)] "[Report Only] Refused to frame 'https://9535906.fls.doubleclick.net/' because it violates the following Content Security Policy directive: "frame-src 'self' https://survicate.com *.practo.com *.criteo.net *.criteo.com https://www.facebook.com https://bid.g.doubleclick.net https://secure.livechatinc.com".
", source: https://www.googletagmanager.com/ (0)
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
Tauqeer Sajid
  • 101
  • 1
  • 1
  • 10

2 Answers2

2

This error message...

[1206/102937.475:INFO:CONSOLE(0)] "[Report Only] Refused to load the script 
'https://www.googletagmanager.com/gtag/js?id=AW-942004674' because it violates the following Content Security Policy directive: "script-src 'self' 'unsafe-inline' 'unsafe-eval' 'strict-dynamic' 'nonce-3RJz12sDPuoV27qS7dcBXLRZawmPobLo' *.practo.com *.practostatic.com *.onesignal.com *.mxpnl.com *.mixpanel.com *.facebook.com *.facebook.net *.twitter.com *.gstatic.com *.googleapis.com *.google.com *.googlesyndication.com *.newrelic.com *.google-analytics.com *.googletagmanager.com *.googleadservices.com *.googlesyndication.com *.doubleclick.net *.survicate.com in.wzrkt.com *.nr-data.net *.newrelic.com *.speedcurve.com *.ampproject.org *.netcore.co.in *.netcoresmartech.com *.criteo.net *.criteo.com https://secure.livechatinc.com". 'strict-dynamic' is present, so host-based whitelisting is disabled. Note that 'script-src-elem' was not explicitly set, so 'script-src' is used as a fallback.
.
[1206/125830.508:INFO:CONSOLE(0)] "[Report Only] Refused to frame 'https://9535906.fls.doubleclick.net/' because it violates the following Content Security Policy directive: "frame-src 'self' https://survicate.com *.practo.com *.criteo.net *.criteo.com https://www.facebook.com https://bid.g.doubleclick.net https://secure.livechatinc.com".
", source: https://www.googletagmanager.com/ (0)

...implies that the ChromeDriver was unable to initiate/spawn a new Browsing Context i.e. Chrome Browser session.


Content Security Policy (CSP)

To mitigate the cross-site scripting issues Chrome's extension system has implemented the concept of Content Security Policy (CSP) which introduces some strict policies that will make extensions more secure by default and provides us the ability to create and enforce rules governing the types of content that can be loaded and executed by your extensions and applications. CSP works as a block/allowlisting mechanism for resources loaded or executed by your extensions. Defining a reasonable policy for your extension enables you to consider the resources that your extension requires and to negotiate with the browser to ensure that those are the only resources your extension has access to. These policies provide security even above the host permissions your extension requests acting as an additional layer of protection. Such policies are defined via an HTTP header or meta element. Within Chrome's extension system the extension's policy is defined via the extension's manifest.json file as follows:

{
  "content_security_policy": "[POLICY STRING GOES HERE]"
}

Relaxing the Content Security Policy

Till Chrome 45, there was no mechanism for relaxing the restriction against executing inline JavaScript. In particular, setting a script policy that includes 'unsafe-inline' will have no effect. However, from Chrome 46 onwards, inline scripts can be allowed by specifying the base64-encoded hash of the source code in the policy. This hash must be prefixed by the used hash algorithm (sha256, sha384 or sha512). This can be achived by setting adding http://* to both style-src and/or script-src as follows:

script-src 'self' http://xxxx 'unsafe-inline' 'unsafe-eval'; 

and/or

style-src 'self' http://xxxx 'unsafe-inline' 'unsafe-eval';

This usecase

However I was able to access the webpage https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154 easily as follows:

  • Code Block:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    options = webdriver.ChromeOptions() 
    options.add_argument('window-size=1200x600')
    options.add_argument('--headless')
    options.add_experimental_option("excludeSwitches", ["enable-automation"])
    options.add_experimental_option('useAutomationExtension', False)
    driver = webdriver.Chrome(options=options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
    driver.get("https://www.practo.com/delhi/doctor/dr-rajeev-puri-ear-nose-throat-ent-specialist?specialization=Ear-Nose-Throat%20(ENT)%20Specialist&practice_id=912154")
    print(driver.page_source)
    driver.quit()
    
  • Console Output:

    <html><head><title>Dr. Rajeev Puri - ENT/ Otorhinolaryngologist - Book Appointment Online, View Fees, Feedbacks | Practo</title><meta name="description" content="Dr. Rajeev Puri is an ENT/ Otorhinolaryngologist in DLF Phase IV. Book appointments Online, View Fees, User Feedbacks for Dr. Rajeev Puri | Practo"><meta charset="utf-8"><meta http-equiv="x-ua-compatible" content="ie=edge"><script src="https://js-agent.newrelic.com/nr-spa-1026.min.js"></script><script src="//survey.survicate.com/workspaces/wfhrNWYKtlLEWMqcaXcweuzHeMRiSljw/web_surveys.js" async=""></script><script src="//api.survicate.com/assets/survicate.js" async=""></script><script src="//survey.survicate.com/workspaces/wfhrNWYKtlLEWMqcaXcweuzHeMRiSljw/web_surveys.js" async=""></script><script src="//api.survicate.com/assets/survicate.js" async=""></script><script src="https://surveys-static.survicate.com/widget_core-3.0.4.js" async=""></script><script src="//survey.survicate.com/workspaces/wfhrNWYKtlLEWMqcaXcweuzHeMRiSljw/web_surveys.js" async=""></script><script src="//survey.survicate.com/workspaces/wfhrNWYKtlLEWMqcaXcweuzHeMRiSljw/web_surveys.js" async=""></script><script src="//survey.survicate.com/workspaces/wfhrNWYKtlLEWMqcaXcweuzHeMRiSljw/web_surveys.js" async=""></script><script src="//api.survicate.com/assets/survicate.js" async=""></script><script src="//api.survicate.com/assets/survicate.js" async=""></script><script src="//api.survicate.com/assets/survicate.js" async=""></script><script type="text/javascript" async="" src="https://www.googleadservices.com/pagead/conversion_async.js" nonce=""></script><script type="text/javascript" async="" src="https://www.google-analytics.com/plugins/ua/ec.js" nonce=""></script><script type="text/javascript" async="" src="https://www.practostatic.com/pel/clevertap/a.js"></script><script async="" src="//sweep.practo.com/sp.js"></script><script type="text/javascript" src="https://www.practostatic.com/pel/pel-1.6.1.js"></script><script async="" src="https://connect.facebook.net/en_US/fbevents.js"></script><script async="" src="//www.google-analytics.com/analytics.js"></script><script async="" src="https://www.googletagmanager.com/gtm.js?id=GTM-PSMVGL5"></script><script nonce="" type="text/javascript">(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
                  new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
                  j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
                  'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
                  })(window,document,'script','dataLayer',"GTM-PSMVGL5");</script>
    

Additional Considerations

Ensure that:


Reference

You can find a relevant discussion in Call to eval() blocked by CSP with Selenium IDE

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
0

There's actually a chrome devtools protocal command for this but it's marked experimental:

driver.execute_cdp_cmd("Page.setBypassCSP", {"enabled": True})
pguardiario
  • 53,827
  • 19
  • 119
  • 159