gathering data from clutch.io : some issues with BS4 while working on colab

Question

update: what bout selenium - support in colab: i have checked this..see below!

update 2: thanks to badduker and his reply with the colab-workaround and results - i have tried to add some more code in order to parse some of the results

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import pandas as pd

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")

driver = webdriver.Chrome(options=options)
driver.get("https://clutch.co/it-services/msp")
page_source = driver.page_source
driver.quit()

soup = BeautifulSoup(page_source, "html.parser")


# Extract the data using some BeautifulSoup selectors
# For example, let's extract the names and locations of the companies

company_names = [name.text for name in soup.select(".company-name")]
company_locations = [location.text for location in soup.select(".locality")]

# Store the data in a Pandas DataFrame

data = {
    "Company Name": company_names,
    "Location": company_locations
}

df = pd.DataFrame(data)

# Save the DataFrame to a CSV file

df.to_csv("clutch_data.csv", index=False)

but this leads to no results.

i will try digg any deeper into that - but probably with an new thread.. - thank you dear badduker.

End of the last update - the second update - written on june 22th malaga

good day dear experts - well at the moment i am trying to figure out a simple way and method to obtain data from clutch.io

note: i work with google colab - and sometimes i think that some approches were not supported on my collab account - some due cloudflare-things and issues.

but see this one -

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

this also do not work - do you have any idea - how to solve the issue

it gives back a empty result.

update: hello dear user510170 . many thanks for the answer and the selenium solution - tried it out in google.colab and found these following results

--------------------------------------------------------------------------
WebDriverException                        Traceback (most recent call last)
<ipython-input-2-4f37092106f4> in <cell line: 4>()
      2 from selenium import webdriver
      3 
----> 4 driver = webdriver.Chrome()
      5 
      6 url = 'https://clutch.co/it-services/msp'

5 frames
/usr/local/lib/python3.10/dist-packages/selenium/webdriver/remote/errorhandler.py in check_response(self, response)
    243                 alert_text = value["alert"].get("text")
    244             raise exception_class(message, screen, stacktrace, alert_text)  # type: ignore[call-arg]  # mypy is not smart enough here
--> 245         raise exception_class(message, screen, stacktrace)

WebDriverException: Message: unknown error: cannot find Chrome binary
Stacktrace:
#0 0x56199267a4e3 <unknown>
#1 0x5619923a9c76 <unknown>
#2 0x5619923d0757 <unknown>
#3 0x5619923cf029 <unknown>
#4 0x56199240dccc <unknown>
#5 0x56199240d47f <unknown>
#6 0x561992404de3 <unknown>
#7 0x5619923da2dd <unknown>
#8 0x5619923db34e <unknown>
#9 0x56199263a3e4 <unknown>
#10 0x56199263e3d7 <unknown>
#11 0x561992648b20 <unknown>
#12 0x56199263f023 <unknown>
#13 0x56199260d1aa <unknown>
#14 0x5619926636b8 <unknown>
#15 0x561992663847 <unknown>
#16 0x561992673243 <unknown>
#17 0x7efc5583e609 start_thread

to me it seems to have to do with the line 4 - the

  ----> 4 driver = webdriver.Chrome()

is it this line that needs a minor correction and change!?

update: thanks to tarun i got notice of this workaround here:

https://medium.com/cubemail88/automatically-download-chromedriver-for-selenium-aaf2e3fd9d81

did it: in other words i appied it to google-colab and tried to run the following:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

#if __name__ == "__main__":
        browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get("https://www.reddit.com/")
        browser.quit()

well - finally it should be able to run with this code in colab:

import requests
from bs4 import BeautifulSoup
url = 'https://clutch.co/it-services/msp'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links)

update: see below - the check in colab - and the question - is colab genearlly selenium capable and selenium-ready!?

look forward to hear from you

thanks to @user510170 who has pointed me to another approach :How can we use Selenium Webdriver in colab.research.google.com?

Recently Google collab was upgraded and since Ubuntu 20.04+ no longer distributes chromium-browser outside of a snap package, you can install a compatible version from the Debian buster repository:

Then you can run selenium like this:

from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.headless = True
wd = webdriver.Chrome('chromedriver',options=chrome_options)
wd.get("https://www.webite-url.com")

cf this thread How can we use Selenium Webdriver in colab.research.google.com?

i need to try out this.... - on colab

See if this helps https://medium.com/cubemail88/automatically-download-chromedriver-for-selenium-aaf2e3fd9d81 — Tarun Lalwani, Jun 06 '23 at 03:05
hello many many thanks - i did it - i applied it : is it this line that needs a minor correction and change!? update: thanks to tarun i got notice of this workaround here - see above. But i need to test it now! — malaga, Jun 06 '23 at 08:59
hmm - try to figure it out: WebDriverException: Message: unknown error: cannot find Chrome binary Stacktrace: — malaga, Jun 06 '23 at 09:14
It may be that colab doesn't support selenium, check on that first. — Tarun Lalwani, Jun 07 '23 at 02:56
hello dear Tarun - many many thanks - i checked this - well see the op. where i documented that we re able to install selenium on colab. - dear Tarun i love to hear from you - it would be a pleasure if we can get this to work - — malaga, Jun 07 '23 at 09:08
@Tarun Lalwani : can you help me here. I still stuggle with the approach! — malaga, Jun 19 '23 at 10:53
Why do you have to run this on colab so badly? Setting selenium up on colab is notoriously prone to error and even if you do manage to run it in headless mode, clutch will throw a cloudflare challange at you. — baduker, Jun 21 '23 at 21:09

score 2 · Answer 1 · answered Jun 05 '23 at 06:21

2

If you do print(response.content), you will see the following: Enable JavaScript and cookies to continue. Without using JavaScript, you don't get access to the full content. Here is a working solution based on selenium.

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()

url = 'https://clutch.co/it-services/msp'

driver.get(url=url)
soup = BeautifulSoup(driver.page_source,"lxml")

links = []
for l in soup.find_all('li',class_='website-link website-link-a'):
    results = (l.a.get('href'))
    links.append(results)

print(links, "\n", "Count links - ", len(links))

Result:

...ch.co&utm_medium=referral&utm_campaign=directory', 'https://www.turrito.com/?utm_source=clutch.co&utm_medium=referral&utm_campaign=directory'] 
 Count links -  50

answered Jun 05 '23 at 06:21

user510170

286
2
5

hello many thanks for the reply and for sharing your ideas - well i have postet the results into the threadstart - see above. the results note that the webdriver - chrome - is probably still not correct - or producing some errors - what do y ou say. - look forward to hear from you regards – malaga Jun 05 '23 at 08:15
2

Have you tried the solution from [this post](https://stackoverflow.com/questions/46026987/selenium-gives-selenium-common-exceptions-webdriverexception-message-unknown)? – user510170 Jun 05 '23 at 08:50
hello again - well afaik i need to add the path to chrome!? I am trying to get this workaround implemented in my colab-solution - i hope that i can get this to work.... – malaga Jun 05 '23 at 17:37
dear user510170 - i would love to hear from you - for me it is a challenge to get this clutch-scraper-thing to work.. – malaga Jun 07 '23 at 09:12
2

@malaga, it is difficult to say exactly why the code does not work on your system. I think the problem is in your build. On my `ubuntu 20.04, python==3.9, selenium==4.9.1` everything works correctly. Perhaps you need to add the path to the chrome binary file, for example, `chrome_driver_binary = "/usr/local/bin/chromedriver"` – user510170 Jun 07 '23 at 13:09
2

@malaga Have you tried this [solution](https://stackoverflow.com/questions/51046454/how-can-we-use-selenium-webdriver-in-colab-research-google-com)? – user510170 Jun 07 '23 at 13:16
hello dear user510170 - many thanks not yet - i have gathered the code to install this on colab - see the addition on the original posting - down - should i do the installation like metinoed above!? BTW - i currently set up anaconda on my endeavouros - did the installation - btw duhhh - how to start the anaconda from commandline !? with .... conda...!? – malaga Jun 08 '23 at 10:52
dear user510170 - i really look forward to hear again from you – malaga Jun 08 '23 at 10:52
can you help me here. I still stuggle with the approach! – malaga Jun 19 '23 at 10:54
2

Hello @malaga! really, I was hoping that you had already solved this problem. The proposed option works on ubuntu, but I'm not sure how to solve the problem with installing and running selenium on colab. You may need to create a separate question on this topic. – user510170 Jun 19 '23 at 12:32
1

helllo dear user510179 many thansk for the quick reply awesome to hear from you - i am currently very very busy - but i try to install Anaconda on my endeavourOS - and will try out this on the Anaconda - do you think that this will work - guess that colab is somewhatr tricky here !? – malaga Jun 19 '23 at 13:01
1

Hello dear @malaga. I think if this solves your problem, and the solution will suit you, you should try to do it. – user510170 Jun 20 '23 at 06:20
1

hello dear @user510170 - many many thanks for the reply - you mean the answert that you gave on june 5th - well i will try it out - sure thing. Many many thanks for all you do - you rock!!! – malaga Jun 20 '23 at 08:36

baduker · Answer 2 · 2023-06-21T21:39:41.287

TL;DR

The big guns of selenium can't shoot the Cloudflare sheriff.

The Colab link with what's below.

All right, here's a working selenium on Google colab that proves my point in the comment that even if you run it, you still must deal with a Cloudflare challenge.

Do the following:

Open a new colab Notebook
Run the code below:

%%shell
# Ubuntu no longer distributes chromium-browser outside of snap
#
# Proposed solution: https://askubuntu.com/questions/1204571/how-to-install-chromium-without-snap

# Add debian buster
cat > /etc/apt/sources.list.d/debian.list <<'EOF'
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster.gpg] http://deb.debian.org/debian buster main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-buster-updates.gpg] http://deb.debian.org/debian buster-updates main
deb [arch=amd64 signed-by=/usr/share/keyrings/debian-security-buster.gpg] http://deb.debian.org/debian-security buster/updates main
EOF

# Add keys
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys DCC9EFBF77E11517
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 648ACFD622F3D138
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 112695A0E562B32A

apt-key export 77E11517 | gpg --dearmour -o /usr/share/keyrings/debian-buster.gpg
apt-key export 22F3D138 | gpg --dearmour -o /usr/share/keyrings/debian-buster-updates.gpg
apt-key export E562B32A | gpg --dearmour -o /usr/share/keyrings/debian-security-buster.gpg

# Prefer debian repo for chromium* packages only
# Note the double-blank lines between entries
cat > /etc/apt/preferences.d/chromium.pref << 'EOF'
Package: *
Pin: release a=eoan
Pin-Priority: 500


Package: *
Pin: origin "deb.debian.org"
Pin-Priority: 300


Package: chromium*
Pin: origin "deb.debian.org"
Pin-Priority: 700
EOF

# Install chromium and chromium-driver
apt-get update
apt-get install chromium chromium-driver

# Install selenium
pip install selenium

Run this code then

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--no-sandbox")


driver = webdriver.Chrome(options=options)

driver.get("https://clutch.co/it-services/msp")
print(driver.page_source)
driver.quit()

You should see this:

<html lang="en-US" class="lang-en"><head>
    <title>Just a moment...</title>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=Edge">
    <meta name="robots" content="noindex,nofollow">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <link href="/cdn-cgi/styles/challenges.css" rel="stylesheet">
    

<script src="/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d"></script><script src="https://challenges.cloudflare.com/turnstile/v0/b/19ad4730/api.js?onload=_cf_chl_turnstile_l&amp;render=explicit" async="" defer="" crossorigin="anonymous"></script></head>
<body class="no-js">
    <div class="main-wrapper" role="main">
    <div class="main-content">
        <h1 class="zone-name-title h1"><img src="/favicon.ico" class="heading-favicon" alt="Icon for clutch.co">clutch.co</h1><h2 id="challenge-running" class="h2">Checking if the site connection is secure</h2><div id="challenge-stage"></div><div id="challenge-spinner" class="spacer loading-spinner" style="display: block; visibility: visible;"><div class="lds-ring"><div></div><div></div><div></div><div></div></div></div><div id="challenge-body-text" class="core-msg spacer">clutch.co needs to review the security of your connection before proceeding.</div><div id="challenge-explainer-expandable" class="hidden expandable body-text spacer" style="display: none;"><div class="expandable-title" id="challenge-explainer-summary"><button class="expandable-summary-btn" id="challenge-explainer-btn" type="button">Why am I seeing this page?<span class="caret-icon-wrapper"> <div class="caret-icon"></div> </span> </button> </div> <div class="expandable-details" id="challenge-explainer-details">Requests from malicious bots can pose as legitimate traffic. Occasionally, you may see this page while the site ensures that the connection is secure.</div></div><div id="challenge-success" style="display: none;"><div class="h2"><span class="icon-wrapper"><img class="heading-icon" alt="Success icon" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAADQAAAA0CAMAAADypuvZAAAANlBMVEUAAAAxMTEwMDAxMTExMTEwMDAwMDAwMDAxMTExMTExMTEwMDAwMDAxMTExMTEwMDAwMDAxMTHB9N+uAAAAEXRSTlMA3zDvfyBAEJC/n3BQz69gX7VMkcMAAAGySURBVEjHnZZbFoMgDEQJiDzVuv/NtgbtFGuQ4/zUKpeMIQbUhXSKE5l1XSn4pFWHRm/WShT1HRLWC01LGxFEVkCc30eYkLJ1Sjk9pvkw690VY6k8DWP9OM9yMG0Koi+mi8XA36NXmW0UXra4eJ3iwHfrfXVlgL0NqqGBHdqfeQhMmyJ48WDuKP81h3+SMPeRKkJcSXiLUK4XTHCjESOnz1VUXQoc6lgi2x4cI5aTQ201Mt8wHysI5fc05M5c81uZEtHcMKhxZ7iYEty1GfhLvGKpm+EYkdGxm1F5axmcB93DoORIbXfdN7f+hlFuyxtDP+sxtBnF43cIYwaZAWRgzxIoiXEMESoPlMhwLRDXeK772CAzXEdBRV7cmnoVBp0OSlyGidEzJTFq5hhcsA5388oSGM6b5p+qjpZrBlMS9xj4AwXmz108ukU1IomM3ceiW0CDwHCqp1NjAqXlFrbga+xuloQJ+tuyfbIBPNpqnmxqT7dPaOnZqBfhSBCteJAxWj58zLk2xgg+SPGYM6dRO6WczSnIxxwEExRaO+UyCUhbOp7CGQ+kxSUfNtLQFC+Po29vvy7jj4y0yAAAAABJRU5ErkJggg=="></span>Connection is secure</div><div class="core-msg spacer">Proceeding...</div></div><noscript>
            <div id="challenge-error-title">
                <div class="h2">
                    <span class="icon-wrapper">
                        <div class="heading-icon warning-icon"></div>
                    </span>
                    <span id="challenge-error-text">
                        Enable JavaScript and cookies to continue
                    </span>
                </div>
            </div>
        </noscript>
        <div id="trk_jschal_js" style="display:none;background-image:url('/cdn-cgi/images/trace/managed/nojs/transparent.gif?ray=7daf435aeecd112d')"></div>
        <form id="challenge-form" action="/it-services/msp?__cf_chl_f_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" method="POST" enctype="application/x-www-form-urlencoded">
            <input type="hidden" name="md" value="Y531CK3.GDorU7Iwk2DV6cF23mTV48icLTjuAZV6568-1687382086-0-AY7Fiv3qUhkh_i93AsbTcYh_D3SG2ZegyiWzGIVG8NgRvrQkiLAuCZ_x8rfr_A4Wy5QOyAOrBLs-avkoeJD0_1G3AYtVfv9rIc6umkp5J_y75TurwQH5fCwjSC3biYbFJbdTbW_NeKfRDUQgh230Lb1UMApiygfWXkeMlzznEEKUa3EXALaHU6co68L5nf_vY6c9QyyILeTdhcspjfUkXCUIUB7ff-8QQgCKpUkZa3UH9V9Icbndie4LGMCl_QJsy5jPvIzTt7nAS_Kk1-TrPxZltr8ZyHhjdvhEyVTkyrTi46auFGmixnyt9bK5dKnGv-J59nXp3EMF34gnVnbmTQuMDG9KHaN4bR4Ij6IO94sRnDGIJnXX6aiFLHiqFx9_kh1krAg3qOuuXZ9UghjKoITy2uPx9ng7hZ73p6QILb0aW-f-GL4VBdv-f1mdZyXJYRRrlfpnGoQMy-jxy6zsZshYtI-fzuDAL3A7nU_NVEGoN7SRrS4dFdn2mGhwPwVhhzt37SQ04MMjfs-_r8KNkOVbnNBtfHp_TWwyEbrhM4Lgc-YEYVRrI-J5LVYwIv4K7JAgObKJffhs53zwB0RrFQG3pF2Qy9W8Cxq2HvlKko3clzUXmw6meZfYJPZaYIMbJa39rqF0jltNKoqOcgJa5xQSTSXrNShUO1ClAHsjUGuTA11lM8Dk5rlnS9qXVWhDWI51i-4Q7BPIkb1BqaW6K_0ltyCzXBtN8q1EqrJeno7ryMC1FyCZ2y8Hy0IsHAhNg2DAvhYov34mrEeoOc4iG4ZHZghGAPkf9tNXo5NBTVNbrwDzvwxXaMVWJRHYQ8YB6LiFK7VPWa_ZjEU7GsdWzXpa_Tp4ulnnbUGrdEThXQC3chCij4f3T7m-Pc7LZdTvs-qs2f5g6_kBwiAAro2KelOxhCsf66l5HcpHHy9uhERBx7FgItODQDqG7kR2r80QCo3kOzBqFL3CIsvtg_KYNG8HkxYqDc-YMRWsvBj5Mmt6c8RzCOkDxKC_DJwOj58CeC2o9e-6wCfgcjb0EPR8cTK_S8ht28zPLUCDJ_j119ErBnHJ1zpdJHydT1HEdnK-vaSuyYf69kOSCC7Kij4ZRttSlfiA4k9gau8QoREht_pxMwfxXraBRfYUWVXO_ZSyz561B9C4Fa1L0gW31RXgCRuzCdDg-Cgr9AN8ky06s19D3N4CZLhtGOjRfMbidHVBD9Ppe4jlcUnSx-wdkJkVXZ2S8XO4F4ou7jGhrN9l9mDIDZ98OXaL_CvhHXNBWxE1Gn1_i1_Ndb7VKFP5Y6YuPLTXaN9kS-kF3rZcIBuh_dczTVQKOEWq1QYy9_CBj2sIPSxhcuQCXwTt4K81e6UiIrovBNWiZ4VjKvLdetwmUUgnpfNbssOz5S6GieV7ENqMBdaYlIP9YPdzHdJl4WQ_stCiC_Yc0wew2XI2XvOOil8_7F1yHgCg4mPS98Y9BXNDKiLDGGl3lRs9ydBvCdiY8__KztFLuVyDiWqschUvXUOg07KBtyQDnSxOyZUn873i7Kg4dKoqAyUICRT_nhsNtGUe4wzXYk3eevEG-7Ct4tSBpw6rTrjeNqa9Lsu5b6Pv-eJX0gYpg-1pydKSKLfvQYNp9wjwT-Oh5UH8vw8lo7b3uSc6QMmkaP2jQVDnqIyQDN8cDAYu6Vdr83xiZJG1Qqn80xVe0RMwEzMcjFv7yy6QM3O-uv0tJHC8EnINpXc1uMp1zphYyIgw-xSy68x55DEf38OrsY7xbJUqdMdF_qJQPi3FOh5MYHftgyH1WyDUHrxXiVJYuTMv7DtgaLjGoA0ybDW_PcBOXI5LAXnqYYR92WmHTEghxLHKxpWqZt9t_XS4j4rycqHU261_6zPhkTklv2cUFJOOT5lRTkY3OySP7-CEp0ZgjPrAOu4g-wt1YUprDjQzYrpmlBUXqKXzeJ795UBKn0HZLDoGQkY5_w-deyzcLV4XZXdGrxnAEOQq5Kx330hD2XgH8Q0be4WinLLZ6R8Tsl3c_5UuxLn0YJlxosFgXXLZehemg9WxGzfrOnb_5reyNr_3KU4nYWl9wFy-wsz6HtyPQ_1LnvBBgxVbrCFy-m9Wm8mt1BcaLwTUA2NSpTY0fbSwkuvx0LKTmG865H5C9qqBAgTGw2R99fv6vqq6ZP_HOzv5Q-c5L2C17lCp4cJwOCkvj7NEWQ1iCoi0X6CWZtVYFC-wXeuI4dh2D6BxGtekFuC77-Rt335ib1wPN7bf6_lA-TPb92U2IUCoq8K9frexCE7QzxaCSdKB-wRkE6g5FERuP-waii1Uquiut4aQ8tJVlwvi0nvuOvuP_Rg4P9xa2HlOxzwajrDBmzfnerhwdyEzOYQTXvwDF-ApPg5rPiMpjo29icy0K9arOF9yY_Wf2EXZD-6hjCcDswfhO3lQWFfnf1ANOFvnp0hcCvr-k93ukAnVbm8uorhSIWr2iy1JeNeGH8kM66IDkdlSLnj9igHNf6C0vDnkBoOolfXQECpmfhS6dai7Np01RQjoKsoGQU1S4rQnjsdYxBdOXqdrfYw_wfsBhV87qxHGUND6uD6m3qwU2vKCyQa_GSIGgzPfqWnhpXozyHUbmBOYJDiKI6u0x3u8mZDWhaaQWYttxUa1gQKnOQy1qM5NI8D881kXI_M2cpvX4rW9coG1k9_qE7yC--4u537ojssm9gSzNnQgeOpn-___N978hwxMqftej1jdhMJePK959TjaeJvMu045n-xtFbFGF81FIhiKtMWskbvRy1wIB3I">
        <span style="display: none;"><span class="text-gray-600" data-translate="error">error code: 1020</span></span></form>
    </div>
</div>
<script>
    (function(){
        window._cf_chl_opt={
            cvId: '2',
            cZone: 'clutch.co',
            cType: 'managed',
            cNounce: '44156',
            cRay: '7daf435aeecd112d',
            cHash: 'f86e351e5e00345',
            cUPMDTk: "\/it-services\/msp?__cf_chl_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA",
            cFPWv: 'b',
            cTTimeMs: '1000',
            cMTimeMs: '0',
            cTplV: 5,
            cTplB: 'cf',
            cK: "",
            cRq: {
                ru: 'aHR0cHM6Ly9jbHV0Y2guY28vaXQtc2VydmljZXMvbXNw',
                ra: 'TW96aWxsYS81LjAgKFgxMTsgTGludXggeDg2XzY0KSBBcHBsZVdlYktpdC81MzcuMzYgKEtIVE1MLCBsaWtlIEdlY2tvKSBIZWFkbGVzc0Nocm9tZS85MC4wLjQ0MzAuMjEyIFNhZmFyaS81MzcuMzY=',
                rm: 'R0VU',
                d: 'Ac34gEYVhl8DXbnILOq76p8yhzcHr06ria7SjaltDZ17DDHJrhCowkieLnLjzsxr3IgprB+0nJObDfv3tbOFZfQanW8VrnMBqy2JC8EFTBSXy7ra08EgPGOSUetaRr/bENIZ81mt06Vq52ykJX01fCO0wyHdNMat8fNwgF9RDfp7CFMpUtp0E+lofrj9tut74nR1+yniOo1zFt2zmKVpFFUunX1K1oMy8Fp1ubIQgHIBEG8g8h3CRzHD2WMTRtqYfFvCfD5PhcR+uWWgxf6ybQnii3noC7BLSbJZHZ5abVjNKZTvRGyLtkP8uNLoAQTF8A5ir68vmv+c6weSVw845TjogSfOFzHrXQvj5dnpPWEmReEsQfl2p3nJJuswyd/OUIPTMuLfPOM7EYHQKawKqI1+jp15e4QZjAl4LIhAwQoHqqcXPd9NqvBkzxrb7YhWBsvOHzgUMb5gR3exN42NVnFbUimWWdhX7Ei+tXR43I+68kGLFe4kQccvXzfYtl3G7mudbXvhkFMjAJk24bb9ugax1RyJeT1HMXZAZG7vOzGxEpf2Zgly+6twZ+C1JShkmfbHj9Z8EkYIlkxm99wVFg==',
                t: 'MTY4NzM4MjA4Ni44NzAwMDA=',
                cT: Math.floor(Date.now() / 1000),
                m: 'eRBgvpMHb6ottjHZ8LYOdoe7cvhlOKe5j2vP7BjQYIE=',
                i1: 'DrqvOBUgqLvl22W0Yoh8VA==',
                i2: 'Co7rIFnUzVj/9LmqAUCUUw==',
                zh: 'MYPZaDt93/n+i/zoik8Q5B4rNo75M88ZQHevg31AJek=',
                uh: 'U3QjejX60yUnAxm0WjPwFsHXm0FG5VD2yNoc1w8iQek=',
                hh: 'w+icDAWoSjxex064a5CZutpetBiSACwcZG4EmfuqjNI=',
            }
        };
        var trkjs = document.createElement('img');
        trkjs.setAttribute('src', '/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d');
        trkjs.setAttribute('alt', '');
        trkjs.setAttribute('style', 'display: none');
        document.body.appendChild(trkjs);
        var cpo = document.createElement('script');
        cpo.src = '/cdn-cgi/challenge-platform/h/b/orchestrate/managed/v1?ray=7daf435aeecd112d';
        window._cf_chl_opt.cOgUHash = location.hash === '' && location.href.indexOf('#') !== -1 ? '#' : location.hash;
        window._cf_chl_opt.cOgUQuery = location.search === '' && location.href.slice(0, location.href.length - window._cf_chl_opt.cOgUHash.length).indexOf('?') !== -1 ? '?' : location.search;
        if (window.history && window.history.replaceState) {
            var ogU = location.pathname + window._cf_chl_opt.cOgUQuery + window._cf_chl_opt.cOgUHash;
            history.replaceState(null, null, "\/it-services\/msp?__cf_chl_rt_tk=fodu0hgxQaVwZPaQ.DlaPwI.O5svWinEWf94LM_MirI-1687382086-0-gaNycGzNCxA" + window._cf_chl_opt.cOgUHash);
            cpo.onload = function() {
                history.replaceState(null, null, ogU);
            };
        }
        document.getElementsByTagName('head')[0].appendChild(cpo);
    }());
</script><img src="/cdn-cgi/images/trace/managed/js/transparent.gif?ray=7daf435aeecd112d" alt="" style="display: none">




<div class="footer" role="contentinfo"><div class="footer-inner"><div class="clearfix diagnostic-wrapper"><div class="ray-id">Ray ID: <code>7daf435aeecd112d</code></div></div><div class="text-center" id="footer-text">Performance &amp; security by <a rel="noopener noreferrer" href="https://www.cloudflare.com?utm_source=challenge&amp;utm_campaign=m" target="_blank">Cloudflare</a></div></div></div><span id="trk_jschal_js"></span></body></html>

As you can see, running selenium doesn't change much.

So, my question to you is:

Why do you want to stick to colab so badly?

Because, running a slightly modified code locally:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait

options = webdriver.EdgeOptions()
options.add_argument("--window-size=1920x1080")
options.add_argument("--disable-gpu")
options.add_argument("--disable-extensions")
browser = webdriver.Edge(options=options)

browser.get("https://clutch.co/it-services/msp")

print("Waiting for download links to appear...")
WebDriverWait(browser, 5).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".infobar__counter"))
)

css_selector = ".directory-list div.provider-info--header .company_info a"
download_links = [
    link.get_attribute("href") for link
    in browser.find_elements(By.CSS_SELECTOR, css_selector)
]

print(download_links)

Should open a browser window (this time is Edge) and return this:

['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']

On the other hand, locally, you don't even need selenium if you have cloudscraper.

For example, this:

import cloudscraper
from bs4 import BeautifulSoup

scraper = cloudscraper.create_scraper()
source = scraper.get("https://clutch.co/it-services/msp")

css_selector = ".directory-list div.provider-info--header .company_info a"

links = [
    f'https://clutch.co{anchor["href"]}' for anchor in
    BeautifulSoup(source.text, "html.parser").select(css_selector)
]
print(links)

Should return:

['https://clutch.co/profile/empist', 'https://clutch.co/profile/sugarshot', 'https://clutch.co/profile/veraqor', 'https://clutch.co/profile/vertical-computers', 'https://clutch.co/profile/andromeda-technology-solutions', 'https://clutch.co/profile/betterworld-technology', 'https://clutch.co/profile/symphony-solutions', 'https://clutch.co/profile/andersen', 'https://clutch.co/profile/blackthorn-vision', 'https://clutch.co/profile/pca-technology-group', 'https://clutch.co/profile/deft', 'https://clutch.co/profile/varsity-technologies', 'https://clutch.co/profile/techprocomp', 'https://clutch.co/profile/vintage-it-services', 'https://clutch.co/profile/imagis', 'https://clutch.co/profile/xiztdevops', 'https://clutch.co/profile/parachute-technology', 'https://clutch.co/profile/blackpoint-it', 'https://clutch.co/profile/exigent-technologies', 'https://clutch.co/profile/xenonstack', 'https://clutch.co/profile/it-outposts', 'https://clutch.co/profile/integris', 'https://clutch.co/profile/techmd', 'https://clutch.co/profile/total-networks', 'https://clutch.co/profile/applied-tech', 'https://clutch.co/profile/alpacked', 'https://clutch.co/profile/bit-bit-computer-consultants', 'https://clutch.co/profile/framework-it', 'https://clutch.co/profile/britenet', 'https://clutch.co/profile/success-computer-consulting', 'https://clutch.co/profile/cyberduo', 'https://clutch.co/profile/bca-it', 'https://clutch.co/profile/britecity', 'https://clutch.co/profile/designdata', 'https://clutch.co/profile/ascendant-technologies-0', 'https://clutch.co/profile/ripple-it', 'https://clutch.co/profile/tpx-communications', 'https://clutch.co/profile/xvand-technology-corp', 'https://clutch.co/profile/sikich', 'https://clutch.co/profile/cloudience', 'https://clutch.co/profile/mis-solutions', 'https://clutch.co/profile/real-it-solutions', 'https://clutch.co/profile/arium', 'https://clutch.co/profile/intetics', 'https://clutch.co/profile/gencare', 'https://clutch.co/profile/innowise-group', 'https://clutch.co/profile/tech-superpowers-0', 'https://clutch.co/profile/spd-group', 'https://clutch.co/profile/juern-technology', 'https://clutch.co/profile/turrito-networks']

PS. The source for the Debian magic on colab is here.

good day dear baduker - this is a great approach and you prove that one can do allmost impossible things on colab too. Afaik - you show how to do that clutch-scraper on the colab-environment - and your example shows how one can do even such tricky things like this - on the colab-environemt. Many thanks for the headsup and for encouraging us to go ahead. This is a pretty nice learning curve. - outstanding and damned cool ;) — thannen, Jun 22 '23 at 17:51
update 2: thanks to badduker and his reply with the colab-workaround and results - i have tried to add some more code in order to parse some of the results. see above the update 2 well i probably will try out this on a new thread - what do you say!? — malaga, Jun 22 '23 at 20:03
hello dear baduker - many many thanks for the answer and your explanation - i am trying to replicate it here - and sure: i am currently installing Anaconda on a Linux-Notebook to work on with all that things - here locally - so that i do not run into any restrictcions of Colab. Many many thanks for all you do - its so awesome!!! – — malaga, Jun 22 '23 at 20:04
@malaga adding more code to `colab` won't get you *anywhere*. In other words, it's not going to work. Cloudflare detects the headless browser and stops you right there. There's no HTML to parse with *any* of the data you see on the website. Hence, no results with `bs4`, `pandas` or anything else. Try running the code *locally*, either in your Anaconda set up or in a virtual environment, preferably with the `cloudscraper` module. — baduker, Jun 22 '23 at 20:18
The whole thing with running `selenium` on `colab` was to show you that it's a, well, dead-end. There's nowhere to go from there. You've asked for *other* approaches, and here they are. You're more likely to get the desired results by *not* running *any* of the suggested solutions on `colab`. — baduker, Jun 22 '23 at 20:24

gathering data from clutch.io : some issues with BS4 while working on colab

2 Answers2

Linked