0

I have number of scenarios that I am trying to account for, but can't seem to nail down my match string (#regexbeginner). Unfortunately, no JavaScript is possible, as this regex is being used within Adobe Analytics's Classification Rule Builder.

What I am after are three groups:

  1. Base URL (not including http[s]:\/\/www.)
  2. The tracking code (everything after the ?, but before the #)
  3. The hash (everything after the #)

The thing is, the tracking codes and hashes are optional. Both might appear, one of them might appear, or none of them might appear. There can also never be more that one tracking code or more than one hash present in the URL, and the hash will never appear before the tracking code.

Here is where I have got to so far: ^http[s]:\/\/www.(.+\/.+)\?(.+)?#(.+)? This works fine if there is both a tracking code and a hash, but it does not work if one, or none of them are present.

Below are my test cases. All of them need to return three groups, but I understand that group 2 and/or group 3 may be empty.

Any help would be appreciated. Feel like this should be easy for someone with a little experience.

Thanks, Chris

Chris
  • 737
  • 3
  • 16
  • 32
  • Possible duplicate of [URL parsing in Java](https://stackoverflow.com/questions/21545912/url-parsing-in-java) – Nir Alfasi Jun 14 '17 at 21:13
  • Don't re-invent the wheel, use [the right tool](https://docs.oracle.com/javase/tutorial/networking/urls/urlInfo.html) to do the job! – Nir Alfasi Jun 14 '17 at 21:14
  • Do you need to match **all** the cases stated or just whree you have the tracking code or hash? – Kind Stranger Jun 14 '17 at 21:20
  • Hi all, thanks for weighing in. Just to clarify - I'm after 3 groups, though sometimes I know that these groups may be empty. For my last test case g1 = example.com/en-US/tires/wrangler-duratrac, g2 = sku=150638601, g3 = 121 – Chris Jun 14 '17 at 21:24
  • For my first test case g1 = example.com/en-US/tires/wrangler-duratrac and g2 and g3 would be empty – Chris Jun 14 '17 at 21:25
  • So all the test cases need to return a g1, g2 and g3 result, but as mentioned g2 and g3 might be empty. FYI - this is being used in Adobe Analytics to classify URLs on the fly. No java script is possible. – Chris Jun 14 '17 at 21:27

1 Answers1

1

This seems to do the trick, match all your above test cases:

^https:\/\/www\.([^?#\s]+)(\?[^\s#]*)?(#.*)?
  • Group 1 is anything after https://wwww. up to either #or?.
  • Group 2 is optional and matches ? and any character following up to #
  • Group 3 is optional and matches # and any character following

Using the example https://www.example.com/en-US/tires/wrangler-duratrac?sku=150638601#121:

  • Group 1 = example.com/en-US/tires/wrangler-duratrac
  • Group 2 = ?sku=150638601
  • Group 3 = #121

For https://www.example.com/en-US/tires/wrangler-duratrac#121

  • Group 1 = example.com/en-US/tires/wrangler-duratrac
  • Group 2 is empty
  • Group 3 = #121
Kind Stranger
  • 1,736
  • 13
  • 18