1

The following error is raised when running the below script:

requests.exceptions.MissingSchema: Invalid URL 'None': No schema supplied. Perhaps you meant http://None?

I saw one solution which said to find elements by xpath but as I said I am new I am unable to replicate the code.

import requests
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

option = webdriver.ChromeOptions()
option.add_argument("headless")

driver = webdriver.Chrome(options=option)
driver.get("https://charities.govt.nz/")
links = driver.find_elements_by_css_selector("a")
print("Number of links : %s" %len(links))

for link in links:
    r = requests.head(link.get_attribute('href'))
    print(link.get_attribute('href'), r.status_code)

A point in the right direction would be appreciated.

Marius Mucenicu
  • 1,685
  • 2
  • 16
  • 25
G.Dhar
  • 23
  • 6

1 Answers1

0

The main issues were being caused by invalid links on that website (i.e. there is a link that's an empty string). I've modified your code to work with that using an if-statement within the for-loop to check that the link starts with http (so it will also ignore the mailto link on that website).

I've also modified how the links are being retrieved and stored. The links are retrieved from the web page using xpath. The links are stored in a list of unique strings rather than a list of firefoxWebElements (i.e. increasing usability and removing duplicate links).

import requests
from selenium import webdriver

option = webdriver.ChromeOptions()
option.add_argument("headless")

driver = webdriver.Chrome(options=option)
driver.get("https://charities.govt.nz/")

# This variable contains all the links found
all_links = [link.get_attribute("href") for link in driver.find_elements_by_xpath("//a[@href]")]

# This variable contains all unique links (i.e. removes duplicates)
unique_links = list(dict.fromkeys(all_links))

print("Number of links : %s" %len(all_links))
print("Number of unique links : %s" %len(unique_links))

for link in unique_links:
    # If the link is easy to work with
    if link.startswith("http"):
        req = requests.head(link)
        print(link, req.status_code)
    else:
        print("Ignoring '{}'".format(link))

Roqux
  • 608
  • 1
  • 11
  • 25
  • some marked my question as duplicate but the solutions they provided don't work with my scenario. So. thank you @Tyler for replying. While running your statement I am getting "AttributeError: 'FirefoxWebElement' object has no attribute 'startswith" Can you please suggest. I am very new to programming it takes me hours just to resolve one error. – G.Dhar Oct 02 '19 at 01:29
  • Sorry, I have edited my answer. It was laziness on my part as I was using `link` as a `string` datatype when it is actually a `FirefoxWebElement`. The above code now stores the `href` in a string via `link_url = link.get_attribute('href')` allowing you to use and manipulate `link_url`. – Roqux Oct 02 '19 at 02:06
  • it still comes up with the same error. File "C:\Users\dharg\PycharmProjects\BrokenLinks\venv\lib\site-packages\requests\models.py", line 387, in prepare_url raise MissingSchema(error) requests.exceptions.MissingSchema: Invalid URL '': No schema supplied. Perhaps you meant http://? How can I share the whole result with you? – G.Dhar Oct 02 '19 at 02:30
  • someone mentioned this in a different post "When finding the element by TAG_NAME it shows me the same error but for XPATH it works. the solution mentioned was: links = WebDriverWait(driver, 10).until(EC.visibility_of_any_elements_located((By.XPATH, "//div[@class='rc']//h3//ancestor::a[1]"))) but how do I use it in my case. Again thank you Tyler for your help – G.Dhar Oct 02 '19 at 02:38
  • @G.Dhar, I've once again updated the answer. The real issue was that there was a link on the page with empty string. I've used a simple `if-statement` to ignore this and other "hard-to-deal-with" links so you can keep moving forward with this project. I have also changed the way links are obtained to use `xpath` so you can see how it works for your scenario. – Roqux Oct 02 '19 at 04:01
  • Thank you @Tyler for your help. – G.Dhar Oct 02 '19 at 22:21