0

I am trying to scrape this website Century Office Products, Inc and I am unable to scrape this text:

Century Office Products, Inc. industry is listed as Ret Misc Merchandise

as the tag in which it is contained is #text. Following is the code I have tried:

driver.get('https://www.corporationwiki.com/New-Jersey/Middlesex/century-office-products-inc/53844156.aspx')
text = [k.text for k in driver.find_elements_by_xpath("//div[@class='card']//div[@class='card-body']//h2//following::p[2]")]
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
Prakhar T
  • 47
  • 3
  • So, you are willing to parse that portion of text from that website which you have mentioned within the double quotes in your description and nothing else, right? – SIM Jun 24 '19 at 13:20
  • @SIM Yes,other things I am able to scrap . – Prakhar T Jun 24 '19 at 13:30

2 Answers2

2

Using xpath:

import requests
from lxml.html import fromstring

link = "https://www.corporationwiki.com/New-Jersey/Middlesex/century-office-products-inc/53844156.aspx"

r = requests.get(link, headers={'User-Agent':'Mozilla/5.0'})
tree = fromstring(r.text)
elem = tree.xpath("//*[@class='card-body']/div/following::text()")[0].strip()
print(elem)

using css selector:

import requests
from bs4 import BeautifulSoup

link = "https://www.corporationwiki.com/New-Jersey/Middlesex/century-office-products-inc/53844156.aspx"

r = requests.get(link, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(r.text, 'lxml')
elem = soup.select_one("[class='card-body'] > div").next_sibling.strip()
print(elem)

They both produce the same output:

Century Office Products, Inc. industry is listed as Ret Misc Merchandise.
SIM
  • 21,997
  • 5
  • 37
  • 109
1

To extract the text Century Office Products, Inc. using Selenium you need to use WebDriverWait for the visibility_of_element_located() and you can use the following Locator Strategy:

  • Xpath:

    • Code Block:

      chrome_options = webdriver.ChromeOptions() 
      chrome_options.add_argument("start-maximized")
      chrome_options.add_argument('disable-infobars')
      chrome_options.add_argument('--allow-running-insecure-content')
      driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r'C:\Utility\BrowserDrivers\chromedriver.exe')
      driver.get("https://www.corporationwiki.com/New-Jersey/Middlesex/century-office-products-inc/53844156.aspx")
      print(driver.execute_script('return arguments[0].lastChild.textContent;', WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//h1[@itemprop='legalName']")))).strip())
      
    • Console Output:

      Century Office Products, Inc.
      
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352