0

I want to scrap this website: https://www.sortlist.fr/search

There are lines of websites that can be clicked, and it opens a page for more details of the website. I want to get that URL, but I can't seem to find it in the <a href

I tried inspecting the element, searching if it was somewhere in a script I couldn't find it. I tried looping at the network option from the dev tools, also couldn't manage to find it.

Did anyone get any idea?

By the way, I want to use Selenium for this, but there is no login system. So, is it a good idea, or is there a better way?

Ajeet Verma
  • 2,938
  • 3
  • 13
  • 24
Hyaku
  • 23
  • 6
  • There are _**338**_ elements on the https://www.sortlist.fr/search page having href attribute, which one are you looking for? – undetected Selenium Jun 26 '23 at 08:13
  • @undetectedSelenium as it is a list of websites and I want to get the URL in the href of all of them, I would say every href that lead to a description of the website. It look like this : https://www.sortlist.fr/agency/pursuit-digital – Hyaku Jun 26 '23 at 08:23
  • This is a react-based web application, there is a tool you can try named octoparse web scraper. – aGreenCoder Jun 26 '23 at 08:37
  • @aGreenCoder ok I will take a look at it, but do you think that I can just get the href to be clicked without the need of having the complete URL? – Hyaku Jun 26 '23 at 08:42
  • @Hyaku octopuses web scraper works on the UI level, so yes you can. – aGreenCoder Jun 26 '23 at 08:48
  • @aGreenCoder Ok I started the scan of the site with it I will se the result when it's done. – Hyaku Jun 26 '23 at 08:54

1 Answers1

1

The agences trouvées elements found on the webpage doesn't contains the href attribute:

<a href="" class="h5 bold text-secondary-900 text-truncate mb-8" data-testid="name-cell">Pursuit Digital</a>

So you won't be able to extract the href attributes from the main page straight away.


Solution

Instead you can click and open the agences trouvées in the adjascent tab and print the current URL inducing WebDriverWait for visibility_of_all_elements_located() using the following locator strategy:

  • Code Block:

    driver.get("https://www.sortlist.fr/search")
    parent_window = driver.current_window_handle
    elements = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[data-testid='name-cell']")))
    hrefs = []
    for elem in elements:
      elem.click()
      all_windows = driver.window_handles
      new_window = [window for window in all_windows if window != parent_window][0]
      driver.switch_to.window(new_window)
      print(new_window)
      print(driver.current_url)
      hrefs.append(driver.current_url)
      driver.close()
      driver.switch_to.window(parent_window)
    print(hrefs)
    driver.quit()
    
  • Console Output:

    85F8A3B48F9DF45BEB28D7A530E6979E
    https://www.sortlist.fr/agency/pursuit-digital
    BA4F926FAD46A5EA5F5FC4406861D20D
    https://www.sortlist.fr/agency/rozee-digital
    84E3A361C4202C594893546BEF39CD47
    https://www.sortlist.fr/agency/trends-tokyo
    FC27FFCB9CBE26CD908B8865B8C5CEA5
    https://www.sortlist.fr/agency/cortlex
    64E50C5041A98BECCB17475A80477D60
    https://www.sortlist.fr/agency/steinpilz-gmbh
    36FF3D6D3C803BF05EEBB676D58E2DE7
    https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh
    A13B789C8A618AAD5C372219FC5E3E7E
    https://www.sortlist.fr/agency/cc-systems
    C39AB3659EE6A627044A2A29CC439AFD
    https://www.sortlist.fr/agency/snapp-x
    2979C1A6C0FEF21B3499B2184907F28B
    https://www.sortlist.fr/agency/scrumble
    452F8D30237A146724055715E9690288
    https://www.sortlist.fr/agency/gaofeng-creative
    F05A9B4963C54306ABBB74420481989E
    https://www.sortlist.fr/agency/dashdot
    FE2B66F925ACCA122B86E597D28B5403
    https://www.sortlist.fr/agency/therocketsoft
    FBBE3D1535D35C230A5C7496632435DC
    https://www.sortlist.fr/agency/run-gun-films
    D4C5C162F3C422FB44862563D8AB73DD
    https://www.sortlist.fr/agency/studio-unbound
    329DA752A15041450FF5DDAA7850C332
    https://www.sortlist.fr/agency/contentgo
    B35A03AA6947A1EE043E3EE915E219BE
    https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa
    F77913A1097ACD4DB2B78F4E997B4A0E
    https://www.sortlist.fr/agency/yarandin-llc
    7A3C75AFF9ED31E5C5E5915A7E9A84EB
    https://www.sortlist.fr/agency/fortis-media
    C86FCE23AF84B72CFF793A349C005BDD
    https://www.sortlist.fr/agency/osenorth
    A266A09B3AEDD65E8A43E26DEAECBF22
    https://www.sortlist.fr/agency/apps-square
    ['https://www.sortlist.fr/agency/pursuit-digital', 'https://www.sortlist.fr/agency/rozee-digital', 'https://www.sortlist.fr/agency/trends-tokyo', 'https://www.sortlist.fr/agency/cortlex', 'https://www.sortlist.fr/agency/steinpilz-gmbh', 'https://www.sortlist.fr/agency/everrank-salesdesk24-gmbh', 'https://www.sortlist.fr/agency/cc-systems', 'https://www.sortlist.fr/agency/snapp-x', 'https://www.sortlist.fr/agency/scrumble', 'https://www.sortlist.fr/agency/gaofeng-creative', 'https://www.sortlist.fr/agency/dashdot', 'https://www.sortlist.fr/agency/therocketsoft', 'https://www.sortlist.fr/agency/run-gun-films', 'https://www.sortlist.fr/agency/studio-unbound', 'https://www.sortlist.fr/agency/contentgo', 'https://www.sortlist.fr/agency/tabua-digital-unipessoal-ldaa', 'https://www.sortlist.fr/agency/yarandin-llc', 'https://www.sortlist.fr/agency/fortis-media', 'https://www.sortlist.fr/agency/osenorth', 'https://www.sortlist.fr/agency/apps-square']
    
undetected Selenium
  • 183,867
  • 41
  • 278
  • 352