0

Is a regex of the following form legit in Obj C?

"<(img|a|div).*?>.*?</$1>"

I know it's valid in JS with a \1 instead of $1, but I'm having little luck in Obj C.

puzzl
  • 833
  • 9
  • 19
  • 1
    Have you read the [`NSRegularExpression` docs](https://developer.apple.com/library/mac/documentation/Foundation/Reference/NSRegularExpression_Class/index.html) and followed the link in the first paragraph to the ICU regular expression syntax? – CRD Mar 09 '15 at 20:55
  • Show the code your trying; and no, that regex won't work in objective-c. – l'L'l Mar 09 '15 at 20:55
  • I'm obligated to warn you [to not parse HTML with regular expressions](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). HTML is **not** a regular language. – Joe Mar 09 '15 at 20:58
  • The regex I'm writing is not for parsing HTML, I'm using HTML here as an example, because it's much clearer than the regex I'm using. Yes, I have read the docs, and it's not clear whether this is supported, as the Template Matching Format section comes directly after other syntax tables, and does not specify whether it is valid in the pattern or not. All I'm asking is whether you can use a previous capture group within a pattern. – puzzl Mar 09 '15 at 21:00
  • 2
    Well in that case use `\1` ( `@"... \\1>"` ) instead of `$1`. – Joe Mar 09 '15 at 21:05
  • 1
    From the first table of metacharacters in the docs: *\n Back Reference. Match whatever the nth capturing group matched. n must be a number > 1 and < total number of capture groups in the pattern.* – CRD Mar 09 '15 at 22:04
  • @Joe You should post that as an answer. – Rob Mar 09 '15 at 22:57

2 Answers2

2

NSRegularExpression uses ICU Regular Expressions which uses \n syntax for back references where n is the nth capture group.

<(img|a|div).*?>.*?</\\1>
Joe
  • 56,979
  • 9
  • 128
  • 135
1

Yes, I do believe you can work with capture groups. I had to work with them a bit a little while ago and I have an example in:

-(NSString *) extractMediaLink:(NSString *)link withRegex:(NSString *)regex{
    NSString * utf8Link = [link stringByRemovingPercentEncoding]; 
    NSError * regexError = nil;

    NSRegularExpression * regexParser = [NSRegularExpression regularExpressionWithPattern:regex 
                                                                                  options:NSRegularExpressionCaseInsensitive|NSRegularExpressionUseUnixLineSeparators
                                                                                    error:&regexError];
    NSTextCheckingResult * regexResults =  [regexParser firstMatchInString:utf8Link
                                                                   options:0
                                                                     range:NSMakeRange(0, [utf8Link length])];

    NSString * matchedResults = [utf8Link substringWithRange:[regexResults rangeAtIndex:1]]; // the second capture group will always have the ID

    return matchedResults.length ? matchedResults : @"";
}

When you use an instance of NSRegularExpression to generate an NSTextCheckingResult, the NSTextCheckingResult has a property of numberOfRanges which is documented with:

A result must have at least one range, but may optionally have more (for example, to represent regular expression capture groups).

In my example above (Note: I happen to be parsing HTML, but using an addition pod that traverses HTML by XPath queries, TFHpple -- a lifesaver if you absolutely have to parse HTML), I use the -[NSRegularExpression firstMatchInString:options:range:] to check for the first instance of the tag that matches my regex pattern. From that NSTextCheckingResult I pull out the proper index of the capture group I'm interested in (in this case, [regexResults rangeAtIndex:1])

But, getting to this point was a huge pain in the ass. But to make sure you're getting the right expressions I would highly recommend using Regex101 with the Python setting, and then passing the refined regex into Patterns (Mac App Store)

If you want the full look, I have a fairly detailed project here, but keep in mind it's still a WIP.

Louis Tur
  • 1,303
  • 10
  • 16
  • That's not really answering the question at all. However, Joe above was right, using just \1 (or, more realistically, \\1) instead of $1, however he didn't post it as an answer so I can't vote for it. – puzzl Mar 10 '15 at 01:42