2

I have multiple links...

linkslist = 
[https://test.com
,https://test1.example.com/exm/1/2/3/4
,https://test2.example.test.com/exm/1/2/3/4
,http://test3.com]

From this, I just need to extract the following,

https://test.com
https://test1.com
https://test2.com
http://test3.com

I have tried the following,

 if re.search("http*.com",string1):
...     print "found"
user1050619
  • 19,822
  • 85
  • 237
  • 413

2 Answers2

3

UPDATE: Fixed thanks to @Robin. It worked, but it was a little bit off from what I intended.

Assuming only http or https (and no ports), this works:

(https?://(?:\w+\.)+com)(?:/.*)?

Regular expression visualization

Debuggex Demo

The url is in capture group one.

Explanation of (?:\w+\.)+:

  • One-or-more of
    • one-or-more word-character: letter, digit, or underscore
    • followed by a literal dot.

For example, this portion captures usatoday. and entertainment.usatoday.. All the pre-domain (.com) portions of the url.

To be safe you could also add start- and end-of-line anchors:

^(https?://(?:\w+\.)+com)(?:/.*)?$

To add the possibility of different domains, add them like this:

^(https?://(?:\w+\.)+(?:com|net|org|gov))(?:/.*)?$

Note that this question, and its duplicate, will also be of help: regular expression for url

Community
  • 1
  • 1
aliteralmind
  • 19,847
  • 17
  • 77
  • 108
  • Huh. It's wrong. That's what it is. I'll fix it, and also explain it in an update. Give me a minute. Glad you said something. – aliteralmind Mar 25 '14 at 23:20
  • @user1050619: My answer had a minor problem, which has been fixed. – aliteralmind Mar 25 '14 at 23:25
  • **1**. Since you throw it out anyway, you can in my opinion also remove the optional `(?:/.*)?` It adds something only in the "anchored" solutions, and doesn't check for much ("anything is OK if there is `/` before")... You could then remove the capturing group and directly use the match. **2**. As your link showed, character validation in URL is a lengthy subject. But `\w`, without even a `-`, seems harsh to me. The question here is whether you need to *match* URLs or to *validate* them. If matching it is, I think there can be more general solutions (using `[^.]++` for example) – Robin Mar 25 '14 at 23:39
  • Actually the `(?:/.*)?` avoid errors on `http://foo.community.com`, so nevermind that part :/ – Robin Mar 25 '14 at 23:46
  • You could certainly eliminate the `.*`, and you *could* eliminate the `/`, but I'd choose to leave that in as a boundary (alternatively, you could have an actual word boundary: `\b`). This is a limited answer to a limited problem, and it should work well for him. This is a deep topic, as I refer to in the link at the bottom of my answer. I'd like to see your solution. – aliteralmind Mar 25 '14 at 23:46
  • Forgit to address this part: It could easily allow dashes by changing "\w+" to "[-\w]+" – aliteralmind Mar 26 '14 at 01:49
1

If you don't want to be specific about the .com part, you could use this. It will match URLs starting with http or https and it will only match up until til first forward slash or the end of the string/line, including any port numbers that might be present.

/https?:\/\/[^\/$\s]+/i

These are the results:

https://test.com -> https://test.com
https://test1.example.com/exm/1/2/3/4 -> https://test1.example.com
https://test2.example.test.com/exm/1/2/3/4 -> https://test2.example.test.com
http://test3.com -> http://test3.com
https://test.com:8080 -> https://test.com:8080
https://test1.example.com:3000/exm/1/2/3/4 -> https://test1.example.com:3000
https://test2.example.test.com:80/exm/1/2/3/4 -> https://test2.example.test.com:80
http://test3.com:8000 -> http://test3.com:8000

If you want to exclude port numbers, just add a colon to the non-matching group:

/https?:\/\/[^\/$\s:]+/i

If you do want to be specific about the .com-part, just add that last:

https?:\/\/[^\/\s]+\.com

If you want only .com-domains, but would like to include port numbers, this is the way to go:

https?:\/\/[^\/\s]+\.com(:\d+)?
nordhagen
  • 799
  • 5
  • 18
  • +1. It's a good point, to go up to the first slash, regardless of what precedes it, although to be safe, I'd enforce at least some level of specific characters (such as `[\w.]+`) instead of just "not slash". – aliteralmind Mar 25 '14 at 22:59
  • 1
    Depends on the use case I guess. Testing against legal characters would require something more than a RegExp due to different top level domains allowing different localized character sets, i.e. Norwegian .no domains allow æ, ø and å, among other characters. Depending on the RegExp engine used, a simple \w (word character) matcher might not suffice. In JS \w doesn't match these characters. – nordhagen Mar 25 '14 at 23:07
  • It's a big topic, right? I linked to more comprehensive answers in my own answer. – aliteralmind Mar 25 '14 at 23:13