0

Problem is: check if entered link is valid, optionally that link could be entered both as https://stackoverflow.com/ and stackoverflow.com.

I tried to solve it as

input_url = str(input("Enter url: ")
result = re.findall(r'(http[s]?://)?\S+', input_url)

returns error - Invalid URL '': No schema supplied. Perhaps you meant http://?

no urllib or something else, it has to be only regex

full code:

import re, requests
from collections import Counter
from prettytable import PrettyTable

url_input = str(input("Enter url: "))

url_checked = re.findall(r'(http[s]?://)?\S+', url_input)[0] # берем первый элемент

response = requests.get(str(url_checked)) # запрос на введенную ссылку

result = re.findall( r"\"(?:http[s]?://)?([^:/\s\"]+)/?[^\"]*\"", response.text) # фильтрация ссылок

result.sort() # sorting by alphabet 

# link - https://stackoverflow.com/

pt = PrettyTable(field_names = ["word", "counter"])
pt.add_rows(list(Counter(result).most_common()))
print(pt)
sophros
  • 14,672
  • 11
  • 46
  • 75

1 Answers1

0

Your regular expression seems way too simple to robustly validate URL. I suggest you use the one from here.

sophros
  • 14,672
  • 11
  • 46
  • 75
  • with my regex, links such as google.com will pass, but with regex in your link, it will give me error) –  Jan 09 '21 at 09:18
  • 'google.com' is not a valid URL. It requires a prefix (e.g. 'http://'). I would just check if either `input_url` is a match with the regex linked or it matches a string `'http://' + input_url`. This would satisfy your requirement. – sophros Jan 09 '21 at 11:56
  • this code is kinda old. Now I have a variant that if link is without protocol, it adds https or http to the link and then requests take it. But with this google.com will be matched as invalid –  Jan 09 '21 at 11:58
  • The code might be old but it does not mean it is invalid, right? You can experiment with optional addition of 'www' or modify the pointed to regex yourself. – sophros Jan 11 '21 at 11:16