1

I'm trying to select the id at the end of a URL between the final / and before the first ? example: http://www.website.com/page/support/28685875?JK.kj_id=

Would extract only the id: 28685875

I'm pretty awful at regex and have figured out these

  • ([^/]+$) selects the end 28685875?JK.kj_id=

  • .+?(?=\?) selects the start www.website.com/page/support/28685875

I thought to try and combine these together in various ways but after a few hours I've had no success.

Can anyone shed some light as to what I'm doing wrong / how to select this URL portion?

Edit: I am using aa java based ETL application to transform datasets.

danielpiestrak
  • 5,279
  • 3
  • 30
  • 29
  • 1
    From the regex tag info: "Since regular expressions are not fully standardized, all questions with this tag should also **include a tag specifying the applicable programming language or tool**." – ClickRick Jan 07 '16 at 09:17
  • When speaking about parsing URLs in Java, I'd recommend using [URL class](https://docs.oracle.com/javase/7/docs/api/java/net/URL.html). Unless there are more strict requirements, this seems the cleanest solution. – Wiktor Stribiżew Jan 07 '16 at 10:41

2 Answers2

2

NON-REGEX SOLUTION

In Java, you can use URL class to parse URLs. So, the best solution would be:

URL aURL = new URL("http://www.website.com/page/support/28685875?JK.kj_id=");
String str = aURL.getPath().substring(aURL.getPath().lastIndexOf("/") + 1);
System.out.println(str);

See demo

See Parsing a URL tutorial.

REGEX SOLUTION

The regex that you are looking for should match the last / followed by digits or any symbols up to ? that might also followed by optional characters other than / to the end of string. The part between / and ? can be captured into a group and then used.

\/([^\/]*)\?[^\/]*$

See regex demo

The negated character class [^\/] matches any character but a /. Group 1 will hold the value you need.

To only match the substring you need to use lookarounds:

(?<=/)[^/]*(?=[?][^/]*$)
^^^^^      ^^^

or a simpler one:

(?<=/)[^/?]+(?=[?]|$)

See demo

Java code:

String s = "http://w...content-available-to-author-only...e.com/page/support/28685875?JK.kj_id=";
Pattern pattern = Pattern.compile("(?<=/)[^/?]+(?=[?]|$)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    System.out.println(matcher.group()); 
} 

However, you can use a capturing based regex and access the Group 1 using matcher.group(1).

The (?<=/)([^/?]+)(?=[?]|$) pattern does the following:

  • (?<=/) - checks if there is a / before the currently tested position in the string (if failed, the index is advanced, next position is tested)
  • [^/?]+ - matches 1 or more characters other than / and ? (no escaping necessary here)
  • (?=[?]|$) - checks if the next character is ? or end of string. If not, fail the match.
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Depending on what platform/tool/language you are using, the right solution may be other than regex. Only use this if you are bound to use regex. – Wiktor Stribiżew Jan 07 '16 at 09:13
  • I added a lookaround based solution – Wiktor Stribiżew Jan 07 '16 at 09:24
  • thanks for the detailed answer, In my last few hours of elarnign regex I bashed my head against these ?<=/ lookbacks and couldnt get them to work. This looks like the code I was trying to write but yours worked, I'll have to compare it and test it more when I wake up. sorry for the lack of details in original post, edidted that its a java based etl tool – danielpiestrak Jan 07 '16 at 09:27
  • You are correct about looking to other tools for URL parsing. But most of the time I prefer regex above string parsing: You can separate implementation and context (a pattern can even be given in a config file), a regex can be tested without the need of the implementation and exceptions on a regex are more predictable than exceptions on string parsing (what happens if there is no "/" in the input? what happens if "/" is the last character?) – AutomatedChaos Jan 07 '16 at 10:32
  • @AutomatedChaos: Lots of SO users believe a regex should not be used if the job can be done without it. Here, the requirement is *get the substring after the last `/` up to the first `?` after it if any*. So, if the `/` is at the end, the result will be [an empty string](http://ideone.com/xiHZlZ) as expected. If there is no `/`, also, an [empty string get returned since the path is empty](http://ideone.com/n87ziY) (so, it will return the whole string as is since `lastIndexOf` will return -1, then we add up +1, and get the 0 index as starting point for the `substring`). This is safe to use. – Wiktor Stribiżew Jan 07 '16 at 10:35
  • I would not recommend non-regex option if I was not sure it is efficient for the current scenario. Unless there are stricter requirements (like protocol can go missing), I'd really recommend using that option. – Wiktor Stribiżew Jan 07 '16 at 10:40
2

Try this:

\/([^\/\?]+)(?:\?|$)

This will fetch any character after last "/" and before "?", if "?" exists. Here the first group will provide you the ID.

simplified

(?<=\/)([^\/\?]+)(?=\?|$)

This will fetch the ID without grouping.

Vegeta
  • 1,319
  • 7
  • 17
  • I think it should work, too. Note that `?` does not have to be escaped in a character class. – Wiktor Stribiżew Jan 07 '16 at 09:19
  • I dont think url have more than one "?" . Please correct me if I am wrong ? – Vegeta Jan 07 '16 at 09:21
  • Well, as I say, it should work too in this case. As for `?`, there can be more than one in a URL, see [Is it valid to have more than one question mark in a URL?](http://stackoverflow.com/questions/2924160/is-it-valid-to-have-more-than-one-question-mark-in-a-url) – Wiktor Stribiżew Jan 07 '16 at 09:22
  • This seems to work amazingly. Thank you I can go to sleep peacefully now to wakeup for work in 4 hours. One followup question: How could I modify it to exclude the / and ? from the final result? – danielpiestrak Jan 07 '16 at 09:23
  • cant understand the followup question. please explain ? – Vegeta Jan 07 '16 at 09:26
  • disregard followup question, I misunderstood the regex. Your solution seems to work perfectly, i will need to test it in the morning though. thank you for your time. – danielpiestrak Jan 07 '16 at 09:31