2

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.

This is what I came up with but it seems to be failing in some cases:

regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/

In summary:

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

Henley
  • 21,258
  • 32
  • 119
  • 207
  • Yes, a regex can accomplish this, but there is more to parsing and decoding a URL than the selected answer's regex can handle, especially when dealing with the query section if it's been encoded, or [IDNs](http://en.wikipedia.org/wiki/Internationalized_domain_name) or anything that doesn't have the year at the beginning of the path. In other words, this is only a very specific fix working only on a small domain. Use well tested tools, such as URI or especially Addressable::URI instead, which handle all URLs and the RFCs and give you the needed ancillary code to handle encoding/decoding too. – the Tin Man Aug 02 '13 at 16:19

2 Answers2

8

Please don't use Regex for this. Use the URI library:

require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path

Why?

See everything about this famous question for a good discussion of why these kinds of things are a bad idea.

Also, this XKCD really says why: Yep.

In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?

Community
  • 1
  • 1
Linuxios
  • 34,849
  • 13
  • 91
  • 116
4

If lookaheads are allowed

((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)

Copy + Paste this in http://regexpal.com/

See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz

Image using javascript regex, but it works out the same

enter image description here

(?=) is just a a lookahead

I basically set up three matches from 2XXX up to (in this order):

(?=\?\w+)  # lookahead for a question mark followed by one or more word characters
(?=/\s+)   # lookahead for a slash         followed by one or more whitespace characters
.*\w       # match up to the last word character

I'm pretty sure that some parentheses were not needed but I just copy pasted.

There are essentially two OR | expressions in the (A|B|C) expression. The order matters since it's like a (ifthen|elseif|else) type deal.

You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.

Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.

Also, there is xkcd (https://xkcd.com/208/) for everything:

https://xkcd.com/208/

Kevin Lee
  • 718
  • 6
  • 19