1

I am struggling with how to get a PRXCHANGE statement to evaluate the way I would like it to. I want to remove text that is in between a tagset. While it works fine when there is only one occurrence of the pattern, it does not return what I want when there is multiple occurrences of the pattern, and I believe it has to do with my use of '.*'.

Here are some example strings and my current regex:

data test;
in = 'keep text 1 <TAG> drop text 1 </TAG> keep text 2 <TAG> drop text 2 </TAG> keep text 3';
output;
in = 'This one works! <TAG> drop text 1 </TAG>';
output;
in = '<TAG> drop text 1 </TAG> This one works as well';
output;
in = 'This one works fine too! <TAG> drop text 1 </TAG> This works just dandy';
output;
run;  

data test;  
set test;
out = prxchange("s/<TAG>.*<\/TAG>//i", -1, in);
run;

This results in the following strings:

keep text 1  keep text 3
This one works!
This one works as well
This one works fine too!  This works just dandy

The first string, "keep text 1 keep text 3" is the problematic result. What I am trying to get back is:

keep text 1 keep text 2 keep text 3

I believe the issue has to do with the '.*' component causing the entire string to be consumed and then it backtracks to look for the rest of the pattern, but in this string there is two instances of the pattern but the regex doesn't see it that way. Unfortunately, the text in between could be anything: it could be a single word or a paragraph so I cannot assume anything about what is in the middle, just what is at the start and the end.

0 Answers0