-1

Currently I am trying to get proper URLs from a string containing both proper and improper URLs using Regular Expressions. Result of the code should give a list of the proper URLs from the input string. The problem is I cannot get rid of the "http://example{.com", because all I came up with is getting to the "{" character and getting "http://example" in results.

The code I am checking is below:

import re
text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"
print(re.findall('http[s]?[://](?:[a-zA-Z0-9$-_@.&+])+', text))

So is there a good way to get all the matches but excluding matches containing bad characters (like "{")?

martineau
  • 119,623
  • 25
  • 170
  • 301
Ulvi92
  • 3
  • 1
  • Are all the URLs separated by spaces? – Iain Shelvington Jan 04 '20 at 20:53
  • Yes. The string that i need to extract URLs from is a string containing various versions of URLs separated by spaces – Ulvi92 Jan 04 '20 at 21:02
  • 2
    Does this answer your question? [How do you validate a URL with a regular expression in Python?](https://stackoverflow.com/questions/827557/how-do-you-validate-a-url-with-a-regular-expression-in-python) – Bruno Jan 04 '20 at 21:04
  • _The string that i need to extract URLs from is a string containing various versions of URLs separated by spaces_ In that case what you should do is split the potential URLs and then check each one, which means **this is a duplicate**, as pointed out by @Bruno The only thing I'll add is that you really shouldn't be using regex for any part of this. – AMC Jan 04 '20 at 22:18
  • Another potentially useful question: https://stackoverflow.com/q/22238090/11301900. – AMC Jan 04 '20 at 22:20
  • I tried to use the code given in the examples, but it somehow does not work for me. Thanks anyway. As far as i understand now, the question is not as simple as it sounds and there are much more efficient ways to solve it (without regex) – Ulvi92 Jan 05 '20 at 10:53

2 Answers2

0

It's difficult to know exactly what you need but this should help. It's hard to parse URLs with regular expressions. But Python comes with a URL parser. It looks like they are space separated so you could do something like this

from urllib.parse import urlparse


text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"

for token in text.split():
    result = urlparse(token)
    if result.scheme in {'http', 'https'} \
            and result.netloc \
            and all(c == '.' or c.isalpha() for c in result.netloc):
        print(token)

Split the text into a list of strings text.split, try parse each item urlparse(token). Print if the scheme is http or https and the domain (a.k.a netloc) is non-empty and all characters are a-z or a dot.

Peter Sutton
  • 1,145
  • 7
  • 20
0

In your example, an URL ends with a white space, so you can use a lookahead to find the next space (or the end of the string). To do that, you can use: (?=\s|$).

Your RegEx can be fixed as follow:

print(re.findall(r'http[s]?[:/](?:[a-zA-Z0-9$-_@.&+])+(?=\r|$)', text))

note: don't forget to use a raw string (prefixed by a "r").

You can also improve your RegEx, for instance:

import re

text = "https://example{.com http://example.com http://example.hgg.com/da.php?=id42 http\\:example.com http//: example.com"

URL_REGEX = r"(?:https://|http://|ftp://|file://|mailto:)[-\w+&@#/%=~_|?!:,.;]+[-\w+&@#/%=~_|](?=\s|$)"

print(re.findall(URL_REGEX, text))

You get:

['http://example.com', 'http://example.hgg.com/da.php?=id42']

To have a good RegEx, you can take a look at this question: “What is the best regular expression to check if a string is a valid URL?”

A answer point this RegEx for Python:

URL_REGEX = re.compile(
    r'(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'  # ...or ipv4
    r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'  # ...or ipv6
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)', re.IGNORECASE)

It works like a charm!

Laurent LAPORTE
  • 21,958
  • 6
  • 58
  • 103