0

I am not very familiar with regular expressions and ran into a problem which is beyond me. I would like help with coming up with an expression which tokenizes a string and then gets me everything BUT arbitrary tokens counting from the end.

For example, I would like to get everything BUT P037-077 from the following string

http://www.wayfair.com/George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce-P037-077-GKV1032.html

One approach to do this is to start counting tokens backwards with the delimiter being "-" (there is no guarantee of how many tokens there are to the left of the required part of the string) and get the 2nd and 3rd token and then get everything BUT that.

I got 90% of the expression which is -([^-]*-[^-]*)-[^-]*$ This returns P037-077 but I need to get the complement of that.

I don't know if I've explained very well. I will be happy to explain again if anything is unclear.

I know this can be done easily by tokenizing in any language but unfortunately I do not have the freedom to do that as the tool I am using takes only regex as an input. It uses the Java syntax.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
Rabee
  • 637
  • 5
  • 19
  • which tool are you using! – Anirudha Oct 25 '13 at 03:13
  • Are you seeking the 3rd and 2nd last tokens split on a dash? – Bohemian Oct 25 '13 at 03:14
  • Can you give a few examples of input and expected output so we can understand what you want? Right now it's unclear – Bohemian Oct 25 '13 at 04:01
  • The tool is called Diffbot. Right, the input string should be "George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce-P037-077-GKV1032.html" and the match should be "George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce--GKV1032.html". I've removed some of the unnecessary bits from the string in the original question for the sake of clarity. Again, we need to count tokens from the end because the we need the 2nd and 3rd last token removed. – Rabee Oct 25 '13 at 04:47

3 Answers3

1

This will remove the 2nd and 3rd last tokens when using a dash as the separator:

String cleaned = str.replaceAll("(-[^-]+){2}(?=-[^-]*$)", "");

Here's some test code:

String str = "http://www.wayfair.com/George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce-P037-077-GKV1032.html";
String cleaned = str.replaceAll("(-[^-]+){2}(?=-[^-]*$)", "");
System.out.println(cleaned);

Output:

http://www.wayfair.com/George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce-GKV1032.html

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • This does not work, unfortunately as it matches "P037-077" and then replaces everything but that. It would work if I could use Java's replaceAll method (or any other language's for that matter) but I can use only regex due to the limitations of the tool I am using it on. – Rabee Oct 27 '13 at 15:20
  • What do you mean it didn't work? You tagged the question as "java" and I've tested this code and it *does* work. Why is the question tagged with java? Also, this approach should work for practically every tools out there. What exactly is the tool and what are its regex capabilities? – Bohemian Oct 27 '13 at 16:08
  • You're right. I shouldn't have tagged it as Java, no idea what I was thinking. Sorry about that. I've deleted that tag now. The tool is called diffbot which is a scraping tool. when you want to modify a field returned, it only takes regex as an input. You can't do anything besides that. – Rabee Oct 27 '13 at 20:13
  • The general approach should work. The java solution is pure regex, so it should translate to the diffbot API. – Bohemian Oct 27 '13 at 23:44
  • Unfortunately, it is not pure regex. You use regex to choose the second and third last tokens and use Java's replaceAll method to choose all but the second and third last tokens. I can not use the replaceAll method and need this to be expressed in pure regex. Do you see the problem? – Rabee Oct 28 '13 at 08:42
  • I don't see the problem, because I am *not* doing what you think I'm doing. I am choosing the 2nd and 3rd last tokens and replacing them with blank - effectively deleting them. I am not choosing "everything else", rather "everything else" is what's left after the replacement has been made. – Bohemian Oct 28 '13 at 10:43
  • "I am choosing the 2nd and 3rd last tokens and replacing them with blank - effectively deleting them" I can't do that because I do not have a language to do that in. I need that expressed in pure regex. – Rabee Oct 29 '13 at 14:51
  • What *can* you do then? It is not possible to write an expression that matches some input but "skips/omits" some of the match. Can you refer to groups in the input to be returned from the match? For example, `$1` or `\1` (depending on the tool) for group 1? Look for a method in the API that takes a match regex and a "selection" expression (or similar). – Bohemian Oct 29 '13 at 22:41
  • There is no such expression. Thanks anyway! – Rabee Oct 30 '13 at 13:33
  • What *can* do you? What does the method/function/feature that you are currently trying to use do exactly? – Bohemian Oct 30 '13 at 13:39
  • All I have is one regex field and one "replace" field. Anything matched by the regex in the regex field is replaced by what you input in the "replace" field. So I was hoping to capture everything but those two tokens and then replace them with nothing and I would be left with only those two tokens. – Rabee Nov 01 '13 at 19:41
  • That's what the two parameters of `replaceAll()` are exactly for! Try putting `(-[^-]+){2}(?=-[^-]*$)` in the "regex" field and leave the "replace" field blank. – Bohemian Nov 01 '13 at 19:57
  • I am left with `http://www.wayfair.com/Lite-Source-Checks-Linear-1-Light-Wall-Sconce-IT2126.html`. What I want to be left with is `P037-077`. – Rabee Nov 02 '13 at 15:36
  • At least twice in your question you ask for what I have done: *I would like to get everything BUT P037-077*, and confirmed: *This returns P037-077 but I need to get the complement of that" ("compliment" means "everything else" in mathematics). Now, are you *sure* you want to *extract* "P037-077" and delete the rest of the URL? – Bohemian Nov 02 '13 at 21:00
  • Yes, you understood correctly. And you got me exactly what I wanted. BUT you used the replaceAll function which I cannot use. WITHOUT using the replaceAll function, can I get everything BUT P037-077? – Rabee Nov 05 '13 at 16:29
  • just forget about replaceAll()! It is java's way of processing situations like this; "function" takes the same two parameters. Did you try what I suggested 4 comments ago? To use the parameters? Also, can you confirm your input *just* the URL? – Bohemian Nov 05 '13 at 20:47
0

Use Groups

^(.*)-[^-]*-[^-]*(-[^-]*)$

$1$2 get's what you want


String input="http://www.wayfair.com/George-Kovacs-by-Minka-Bling-Bling-1-Light-Wall-Sconce-P037-077-GKV1032.html";
Matcher m=Pattern.compile("^(.*)-[^-]*-[^-]*(-[^-]*)$").matcher(input);
if(m.find())
{
     String output=m.group(1)+""+m.group(2);
}
Anirudha
  • 32,393
  • 7
  • 68
  • 89
0

The thing you are looking for are "non-capturing groups". A group is anything enclosed in (). Every group will be used in matching and will also be included in the result. A non-capturing group is anything eclosed in (?:) (the first three chars serve as the opening brace). A non-capturing group will be used in matching but will not be included in the result. Example:

^(match_me)(?:but_not_me)$

If you aplly it to say:

1: match_me
2: match_mebut_not_me

it will not match the first strinng since the second part is not present. But it will match the second string but exclude the but_not_me from the match. See "What is a non-capturing group? What does a question mark followed by a colon (?:) mean?" for an example which involves url's.

Community
  • 1
  • 1
Viktor Seifert
  • 636
  • 1
  • 7
  • 17