0

Like the title says, I'm having problems with scraping a site, specifically, it's bloomberg.com. I'm supposed to open a link like this:

from selenium import webdriver
driver = webdriver.Chrome(path_to_driver)
driver.get("https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4253471")


But immediately I get a warning and captcha pops up on the second link I open. I didn't flood the website with other requests or anything, all I'm doing is calling driver.get() every 10 seconds or so.

What I have tried so far: from here link to a similar question. I learned you should modify chromedriver.exe in a HEX editor and replace "$cdc" with something like "xyzw", but doing that has changed nothing (I get different IP when I switch my router on/off so I'm definitely not IP blocked).

Any ideas what can be done here? So far I never encountered something like this before, getting blocked on a first link.

undetected Selenium
  • 183,867
  • 41
  • 278
  • 352
DoctorEvil
  • 453
  • 3
  • 6
  • 18
  • thats not where one should use selenium. for parsing you can go via backend and hit respective apis and then act upon the data you receive in response. – Abhishek_Mishra Aug 17 '18 at 10:48
  • from a link like that one above I need to scrape address and phone number, I doubt that is possible with API – DoctorEvil Aug 17 '18 at 10:58
  • https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4253471 Its a GET call with query-param privcapId, which gives me an HTML response and in there I can see contact number and address details which can easily be parsed from response. @DoctorEvil – Abhishek_Mishra Aug 17 '18 at 11:09
  • how? I tried r=requests.get("https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4253471") and then using lxml tree=html.fromstring(r.content), and finally tree.xpath("//div[@itemprop='address']") gives no results – DoctorEvil Aug 17 '18 at 11:43
  • did you actually try calling the request? I always get "Terms of Service Violation" in response. – DoctorEvil Aug 17 '18 at 11:56
  • you have to provide proper cookies and other data as well @DoctorEvil – Abhishek_Mishra Aug 17 '18 at 13:46

1 Answers1

0

A bit more details about what exactly you wanted to scrape from the website would have helped us to debug the issue in a better way.

However, to scrape the 2 (two) Key Developments you can use the following solution:

  • Code Block:

      from selenium import webdriver
      from selenium.webdriver.support.ui import WebDriverWait
      from selenium.webdriver.common.by import By
      from selenium.webdriver.support import expected_conditions as EC
    
      options = webdriver.ChromeOptions()
      options.add_argument('start-maximized')
      options.add_argument('disable-infobars')
      options.add_argument('--disable-extensions')
      driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
      driver.get('https://www.bloomberg.com/research/stocks/private/snapshot.asp?privcapId=4253471')
      for item in WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.newsItem p"))):
          print(item.get_attribute("innerHTML"))
      driver.quit()
    
  • Console Output:

      CARDONE Industries has named Michael Cardone, III as the Executive Chairman of its Board of Directors. The company is also pleased to announce the addition of Dena Moore and Bill Strahan as new Board members. Michael Cardone, III is an owner of CARDONE Industries and serves on the company's Board of Directors. He has also served in Executive leadership roles with CARDONE, including President, since 1998. As Executive Chairman, he will focus on CARDONE's long-term growth strategies, including acquisition activity and the company's footprint and real estate holdings. He will also be responsible for managing the Board of Directors and its processes. Dena Moore spent 20 years as a senior merger and acquisition investment banker and as Chief Operating Officer for Harris Williams & Co., now a subsidiary of PNC Financial Services Group. Today, as the founder of DFM Advisory, LLC, she works primarily with entrepreneurs to provide strategic and operational consulting services. Bill Strahan is Executive Vice President of Human Resources for Comcast Cable.
      CARDONE Industries, Inc. announced plans to build a new, state-of-the-art distribution center in Harlingen, TX, near the company’s current core processing facilities at 5810 Harrison Avenue. Construction of the new facility is expected to begin in January 2018, and to be finished by December 2018. The new distribution center is intended to support growing production at CARDONE’s manufacturing facilities, and the building will be constructed with the capacity for future expansion, as needed. CARDONE expects the new distribution center to create hundreds of new jobs in the Harlingen area. Along with its facilities in Philadelphia, Texas, Los Angeles, Canada and Mexico, CARDONE added operations in Vancouver, Phoenix, Seattle, Toronto, Spain and China through its recent acquisition of ADP Distributors and Rotomaster on November 20, 2017.
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352