0

trying to learn something about web scraping I thought - it would be a good goal to have a data-driven page - with lots of data to gather from - like clutch.co

I am trying to do some first steps in scraping - whilst running a tiny scraper like so.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(options=options)
driver.get("https://clutch.co/it-services/msp")
page_source = driver.page_source
driver.quit()

soup = BeautifulSoup(page_source, "html.parser")


# Extract the data using some BeautifulSoup selectors
# For example, let's extract the names and locations of the companies

company_names = [name.text for name in soup.select(".company-name")]
company_locations = [location.text for location in soup.select(".locality")]

# Store the data in a Pandas DataFrame

data = {
    "Company Name": company_names,
    "Location": company_locations
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file

df.to_csv("clutch_data.csv", index=False)

but at the moment this runs with an empty result - note I am working on google-colab.

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
malaga
  • 319
  • 2
  • 13
  • 2
    Did you check whether you get useable data in 'page_source' in the first place? – srn Jun 22 '23 at 20:27
  • 1
    Look, this is *not* going to work on `colab`. `Cloudflare` detects the headless browser and prevents you from getting *any* data, because the `HTML` response you get is a [Cloudflare challange](https://developers.cloudflare.com/fundamentals/get-started/concepts/cloudflare-challenges/). Please review the answers in your [previous and bountied question](https://stackoverflow.com/questions/76401453/gathering-data-from-clutch-io-some-issues-with-bs4-while-working-on-colab/). I repeat, you are more likey to get the desired results by *not* running *any* of the suggest solutions on `colab`. – baduker Jun 22 '23 at 20:31
  • hello dear baduker - dear srn - many thanks for the reply. This is a very special case and i believe that the cloudflare-thing is make it even more special - so i am preparing my notebook to get it ready to run form here. many many thanks for all the hints and help. – malaga Jun 22 '23 at 22:20

1 Answers1

-1

Here's a simple, neat, and clean implementation.

import pandas as pd
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome

driver = Chrome()
driver.get("https://clutch.co/it-services/msp")

soup = BeautifulSoup(driver.page_source, "html.parser")
driver.quit()

company_list = soup.select('ul.directory-list>li')
print(f"Total companies on the page: {len(company_list)}")

data = []
for company in company_list:
    data.append({
      "company_name": company.select_one('h3.company_info').text.strip(),
      "company_location": company.select_one('span.locality').text.strip()
    })

df = pd.DataFrame(data)
df.to_csv("clutch_data.csv", index=False)
print(df.head(10))

output:

Total companies on the page: 50

                     company_name        company_location
0                          EMPIST             Chicago, IL
1                       SugarShot       Redondo Beach, CA
2                   Veraqor, Inc.           Princeton, NJ
3              Vertical Computers               Chino, CA
4  Andromeda Technology Solutions            Lockport, IL
5          BetterWorld Technology              Reston, VA
6              Symphony Solutions  Amsterdam, Netherlands
7                   Andersen Inc.            New York, NY
8               Blackthorn Vision          Львів, Ukraine
9            PCA Technology Group             Buffalo, NY
Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24