0

I am writing a program that needs to be able to identify different link structures in a dictionary. The links could look like this https://www.examplelink.com, www.examplelink.com, examplelink.com

Is there a way to identify these link types with a pattern and extract the entire URL from the text? This is my code so far it is able to get the third link example but none of the others. This is my code:

dictionary_itemnumber = 0

pattern1 = "(?P<url>https?://[^\s]+\.(com|net|ru|org|ir|in|uk|au|ua|de|ch))"
for i in range(total):
    if(re.search(pattern1, parsed_text_dictionary["parsed text" + str(dictionary_itemnumber)])):
        print("link found")
        url = re.search("(?P<url>https?://[^\s]+\.(com|net|ru|org|ir|in|uk|au|ua|de|ch))", parsed_text_dictionary["parsed text" + str(dictionary_itemnumber)])
        print(url)
    else:
        print("no link found")
    dictionary_itemnumber = dictionary_itemnumber + 1

#The output of this code is

link found
<re.Match object; span=(132, 168), match='https://www.laufenburg-tourismus.com'>
no link found
no link found

1 Answers1

1

Using https? means that 's' may be or not on the string. It matches only strings with 'https' or 'http' at the beggining. What you are missing is the grouping of the optional https at the beginning: (https?://)?.

Use the following:

r"(?P<url>(https?://)?[^\s]+\.(com|net|ru|org|ir|in|uk|au|ua|de|ch))"

Also, in Python, you should always precede regex string with r to mean raw string. That way you avoid to have escape some characters. See this thread.

Note:

  • Your list of top level domains is limited. There are a lot more top level domains out there. You can always match them with a more general (\.[a-z])+ at the end.

  • Remember that domains can also take the form of example.com.br.

  • Also, you can always test your regex online on a website like regex101.

Read more at Python's regex documentation.

See this other thread to better understand how to match urls with regex.

Maicon Mauricio
  • 2,052
  • 1
  • 13
  • 29