Regex for matching everything before trailing slash, or first question mark?

Question

I'm trying to come up with a regex that will elegantly match everything in an URL AFTER the domain name, and before the first ?, the last slash, or the end of the URL, if neither of the 2 exist.

This is what I came up with but it seems to be failing in some cases:

regex = /[http|https]:\/\/.+?\/(.+)[?|\/|]$/

In summary:

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price/ should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price?id=2 should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price should return 2013/07/31/a-new-health-care-approach-dont-hide-the-price

Yes, a regex can accomplish this, but there is more to parsing and decoding a URL than the selected answer's regex can handle, especially when dealing with the query section if it's been encoded, or [IDNs](http://en.wikipedia.org/wiki/Internationalized_domain_name) or anything that doesn't have the year at the beginning of the path. In other words, this is only a very specific fix working only on a small domain. Use well tested tools, such as URI or especially Addressable::URI instead, which handle all URLs and the RFCs and give you the needed ancillary code to handle encoding/decoding too. — the Tin Man, Aug 02 '13 at 16:19

score 8 · Answer 1 · edited May 23 '17 at 12:33

8

Please don't use Regex for this. Use the URI library:

require 'uri'
str_you_want = URI("http://nytimes.com/2013/07/31/a-new-health-care-approach-dont-hide-the-price").path

Why?

See everything about this famous question for a good discussion of why these kinds of things are a bad idea.

Also, this XKCD really says why: Yep.

In short, Regexes are an incredibly powerful tools, but when you're dealing with things that are made from hundred page convoluted standards when there is already a library for doing it faster, easier, and more correctly, why reinvent this wheel?

edited May 23 '17 at 12:33

Community

1
1

answered Aug 02 '13 at 02:31

Linuxios

34,849
13
91
116

@squiguy: Thanks. (We all know that good answer is #42 on the list of voting priorities. I'm pretty sure that having XKCD is #5 ;) ) – Linuxios Aug 02 '13 at 02:55
URI.new is giving me NoMethodError: undefined method `new' for URI:Module – Henley Aug 02 '13 at 03:05
Also, although I posted Ruby code, I need this regex in the database query as well, so I can't reuse this URI method in the database. So a regex is preferable. But this suffices for my Ruby code. – Henley Aug 02 '13 at 03:07
@HenleyChiu: Whoops! Try it again now. – Linuxios Aug 02 '13 at 03:09
2

As I am fond of saying: Regular expressions are not a magic wand you wave at every problem that happens to involve strings. – Andy Lester Aug 02 '13 at 03:34
1

Regex is still the magic wand though. How do you think URI does it? – pguardiario Aug 02 '13 at 06:50
@pguardiario: See the source (it's in Ruby) [here](https://github.com/ruby/ruby/tree/trunk/lib/uri). – Linuxios Aug 02 '13 at 13:30
1

@Linuxios - I've seen it. If you don't see regex there you need glasses. – pguardiario Aug 02 '13 at 14:51
https://github.com/ruby/ruby/blob/trunk/lib/uri/common.rb is chock-full of regex. – the Tin Man Aug 02 '13 at 16:11

Kevin Lee · Accepted Answer · 2013-08-02T03:57:46.030

If lookaheads are allowed

((2[0-9][0-9][0-9].*)(?=\?\w+)|(2[0-9][0-9][0-9].*)(?=/\s+)|(2[0-9][0-9][0-9].*).*\w)

Copy + Paste this in http://regexpal.com/

See here with ruby regex tester: http://rubular.com/r/uoLLvTwkaz

Image using javascript regex, but it works out the same

enter image description here

(?=) is just a a lookahead

I basically set up three matches from 2XXX up to (in this order):

(?=\?\w+)  # lookahead for a question mark followed by one or more word characters
(?=/\s+)   # lookahead for a slash         followed by one or more whitespace characters
.*\w       # match up to the last word character

I'm pretty sure that some parentheses were not needed but I just copy pasted.

You can probably fix out the prefix, I just assumed that you wanted 2XXX where X is a digit to match.

Also, save the pitchforks everyone, regular expressions are not always the best but it's there for you when you need it.

Also, there is xkcd (https://xkcd.com/208/) for everything:

Regex for matching everything before trailing slash, or first question mark?

2 Answers2

Why?