Need regex to grab everything before and including extension + reject duplicates

Question

I have this string:

http://www.amazon.com/books-used-books-textbooks/b%3Fie%3DUTF8%26node%3D283155
http://www.amazon.com/gp/site-directory
http://www.amazon.com/gp/goldbox
https://en.wikipedia.org/wiki/A
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:GLRqJLrDZEQJ:https://en.wikipedia.org/wiki/A%252Ba%26gbv%3D1%26%26ct%3Dclnk
https://twitter.com/a%3Flang%3Den
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:4teZIJ7lbgsJ:https://twitter.com/a%3Flang%253Den%252Ba%26gbv%3D1%26%26ct%3Dclnk
http://dictionary.reference.com/browse/a
http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:Pn8j0e0faiAJ:http://dictionary.reference.com/browse/a%252Ba%26gbv%3D1%26%26ct%3Dclnk
http://boards.4chan.org/a/

I need to grab all the information upto where the ".com", ".org", or ".net" ends.

The expected output should look like this:

http://www.amazon.com/
https://en.wikipedia.org/
http://dictionary.reference.com/
http://webcache.googleusercontent.com/
http://boards.4chan.org/

So far I've tried a few things:

/(\/)([^\/]+)\Z/
^(http[s]?)(...)\w{3}\
/https?:\/\/[\S]/

None of them worked, so now I'm here. If there's an easier way to do it please let me know. I also need to reject the duplicates if there are any.

Your original question included de-duping. You may be aware (since you've edited your question) that regex won't de-dupe for you. Since you've tagged your question with Ruby I've included the `uniq` in my answer that you'd need in order to de-dupe. — dsample, Mar 05 '16 at 23:40

Casimir et Hippolyte · Answer 1 · 2016-03-06T01:39:35.360

1

Using the URI module (s is your string):

require 'uri'

s.split(/\n/).map { |line|
    uri = URI(line)
    uri.scheme + "://" + uri.host
}.uniq

Note: if your string comes from a file, you don't need to use split:

File.open('yourfile').map { |line|
    uri = URI(line)
    uri.scheme + "://" + uri.host
}.uniq

edited Mar 06 '16 at 01:39

answered Mar 06 '16 at 01:08

Casimir et Hippolyte

88,009
5
94
125

the Tin Man · Answer 2 · 2016-03-06T21:46:17.967

Don't reinvent wheels, reuse existing ones:

require 'uri'

%w[
  http://www.amazon.com/books-used-books-textbooks/b%3Fie%3DUTF8%26node%3D283155
  http://www.amazon.com/gp/site-directory
  http://www.amazon.com/gp/goldbox
  https://en.wikipedia.org/wiki/A
  http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:GLRqJLrDZEQJ:https://en.wikipedia.org/wiki/A%252Ba%26gbv%3D1%26%26ct%3Dclnk
  https://twitter.com/a%3Flang%3Den
  http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:4teZIJ7lbgsJ:https://twitter.com/a%3Flang%253Den%252Ba%26gbv%3D1%26%26ct%3Dclnk
  http://dictionary.reference.com/browse/a
  http://webcache.googleusercontent.com/search%3Fhl%3Den%26biw%26bih%26q%3Dcache:Pn8j0e0faiAJ:http://dictionary.reference.com/browse/a%252Ba%26gbv%3D1%26%26ct%3Dclnk
  http://boards.4chan.org/a/
].map{ |s|
  scheme, _, host = URI.split(s)
  "#{ scheme }://#{ host }"
}.uniq
# => ["http://www.amazon.com", "https://en.wikipedia.org", "http://webcache.googleusercontent.com", "https://twitter.com", "http://dictionary.reference.com", "http://boards.4chan.org"]

If your data is in a string, then split it into lines and iterate over them:

str = "foo
bar
baz"

str.lines.map(&:rstrip)
# => ["foo", "bar", "baz"]

score 0 · Answer 3 · answered Mar 05 '16 at 23:22

0

/^(http[s]?:\/\/[^\/]*)\// will do the trick

answered Mar 05 '16 at 23:22

sammygadd

299
2
6

score 0 · Accepted Answer · answered Mar 05 '16 at 23:32

0

A regex I use for capturing the different parts of URLs is:

^(?<uri_schema_and_host>(?<uri_scheme>https|http):\/\/(?<uri_host>[^\/]+))(?<uri_path>\/[^?]*?)?(?<uri_query>\?.*)?$

This creates named captures for many parts of the URL. We can shorten this a bit for your needs to:

^((https|http):\/\/[^\/]+).*$

In Ruby you can easily utilise this using the scan method on String and use uniq to then de-dupe the results:

regex = /^(?<uri_schema_and_host>(?<uri_scheme>https|http):\/\/(?<uri_host>[^\/]+))(?<uri_path>\/[^?]*?)?(?<uri_query>\?.*)?$/m

results = text.scan regex

scheme_and_hosts = results.map {|x| x[0].to_s }
scheme_and_hosts.uniq!

scheme_and_hosts.each {|x| puts x }

The /.../m at the end of the regex within the ruby script matches each line separately, so scan will find a match on each line.

answered Mar 05 '16 at 23:32

dsample

444
1
4
14

This is a great answer except it keeps telling me that I have unmatched close parenthesis? – 13aal Mar 05 '16 at 23:47
Hmmm, it works for me: https://repl.it/BtUK/0 If you're having problems you might need to post a bit more of your code so we can assist. – dsample Mar 06 '16 at 00:05
Your regex won't ignore duplicates, the question specifically asked to do it by regex and not Ruby code. Although, since your answer got accepted I guess OP was alright with that. – Gediminas Masaitis Mar 06 '16 at 00:30

Gediminas Masaitis · Answer 5 · 2016-03-07T13:05:15.363

The fact that you need to avoid duplicates makes it a bit complicated:

/(?:^|\n)(https?:\/\/[^\/]*?\.(?:com|org|net)\/?)(?!(?:.|\n)*\n\1)/

First, (?:^|\n) checks whether or not it is a beginning of a new line, as we don't want to match anything in the middle of a line. Then we begin capturing our group with (. We match http, and s, if it exists, followed by a colon and two escaped slashes :\/\/. We then capture everything except a slash, with a lazy behavior - capturing as little as we can. Here we could capture any character, however a slash is a good indication that we've gone too far, so we don't want that. Then we capture an escaped dot \. followed by a non-capturing group, which allows us to have either com, org, or net: (?:com|org|net). Finally, if there is a trailing, backslash, \/? captures it too, and the capturing group closes with ).

This is where it gets interesting. While we've successfully captured our links, we want to avoid any duplicates. For this, we employ a negative look-ahead. We assert that we don't want to find:

Any characters, even new line feeds (?:.|\n), taking as many as we can, followed by:
A new line \n, followed by
The whole capture group that we just captured.

The last bit is very important - this is how we ensure we don't get any duplicates. If we just matched e.g. amazon.com, and amazon.com exists anywhere ahead, it won't be captured. As such, only the last instance of amazon.com will be captured.

A graphical visualization may help understanding it even better:

That's pretty smart. I hadn't thought of a look-ahead having the ability to use the capture. Seems a little less readable, but technically correct. I opted to answer with code as I saw 13aal was using Ruby and it is easy to do the processing on the results. — dsample, Mar 06 '16 at 00:51
@dsample And it's probably much faster - this Regex took 26109 steps to go through OP's original test data. I only did this example because OP asked for no duplicates using regex explicitly, and because most people probably don't know this trick, and I though it would be useful for readers to see. In production code, it wouldn't be *that smart* to use this kind of solution. — Gediminas Masaitis, Mar 06 '16 at 00:58
@GediminasMasaitis I really like this answer, however, it tells me that -g is an unknown regex option — 13aal, Mar 06 '16 at 21:26
@13aal Ah, sorry, I didn't know Ruby doesn't have the global search modifier! I edited it out. You can use `.scan` to achieve the same functionality, see more on this [here](http://stackoverflow.com/questions/3588931/ruby-global-match-regexp) — Gediminas Masaitis, Mar 07 '16 at 13:04

Need regex to grab everything before and including extension + reject duplicates

5 Answers5