1

I am attempting to search for a string in order to match on several capture groups. In the case of two such capture groups the data is optional, so they may or may not match. I am using pcregrep with option -onumberto return the various capture groups. The question is: how can I return default value in the case no values match. I tried to use disjunctive but without success.

Example:

../pcre-8.32/pcregrep  -Min -o1 -o2 --om-separator="; " '(?s)<!-- BOUNDARY -->(?!.*?Read the full review).*?((\d*) of (\d*) people found the following review helpful|.*?).*?Help other customers find the most helpful' shirts/B000W18VGW

produces the correct line numbers.

-Min -o1 -o2 --om-separator="; " '(?s)<!-- BOUNDARY -->(?!.*?Read the full review).*?(\d*) of (\d*) people found the following review helpful.*?Help other customers find the most helpful' shirts/B000W18VGW

produces the correct output but only for the lines with

(\d*) of (\d*) people found the following review helpful

If the line above does not exists I would like to return "0" for each of the capture groups.

Is this possible and if so how?

ekad
  • 14,436
  • 26
  • 44
  • 46
user2051561
  • 838
  • 1
  • 7
  • 21

1 Answers1

1

You can't make a character appear magically. That is, if there's no 0 anywhere in your subject string, then there's no way to capture a 0. Thus, if you want to capture a 0, you have to insert a 0 into the subject.

Now, let's say for some crazy reason, you're able and willing to modify your subject string (though apparently you're not able or not willing to set the 0 case outside of the regular expression, i.e. in code). Then, here's one solution.

Append 0 of 0 people found the following review helpful at the very end of your subject string, and instead of this:

((\d*) of (\d*) people found the following review helpful|.*?)

do this:

(?=.*?(\d*) of (\d*) people found the following review helpful)

In other words, by appending the 0 of 0 people [...] you're guaranteeing that that sentence will exist somewhere, so by capturing the numbers within a zero-width lookahead assertion, you can seek the sentence anywhere in your subject string, before carrying on with the rest of your regex.

Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • Appreciate the feedback. Unfortunately your suggestion is not possible. I would like to point out that: 1. I am using pcregrep do analyse all occurrences within a document. So I cannot append anything to the "string". 2. The issue here is not one of matching but of showing group capture. I figure its the group capture within the disjunction that is incorrect. I still think this is possible. Even if I cannot capture as I have shown; may require some post-processing. – user2051561 Feb 08 '13 at 08:07
  • @user2051561 - I could be misunderstanding; so please correct me if I am. I have spent _days_ trying to make a `0` appear when I was trying to write [a regular expression that would increment numbers](http://stackoverflow.com/questions/12941362/is-it-possible-to-increment-numbers-using-regex-substitution). I went quite deep into various escape sequences and tricks, but found no way to capture a character that didn't appear anywhere in the document. Now, against all odds, if you do find a way, then you've also found a better answer for the linked question! – Andrew Cheong Feb 08 '13 at 11:38
  • Doubt it. But if I find a way I will post it here. – user2051561 Feb 08 '13 at 12:59