2

2 SOLUTIONS POSTED AT BOTTOM

My code

    data test;  
        extract_string = "<some string here>";
        my_result1 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "A1M_PRE");  
        my_result2 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "AC2_0M");  
        my_result3 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, ".*$/$1/"), -1, "DE3_1H30M");  
    run;

Desired results

Extract the number after _ but preceding M in strings that have M at the end. The result set should be:

    my_result1 = ""  
    my_result2 = "0"  
    my_result3 = "30"  
    my_result4 = "30"

The following extract_string values fail

"\.*(\d*)M\b\"  
"\.*(\d*?)M\b\"  
"\.*(\d{*})M\b\"  
"\.*(\d{*?})M\b\"  
"\.*(\d){*}M\b\"  
"\.*(\d){*?}M\b\"  

"\.*(\d+)M\b\"  
"\.*(\d+?)M\b\"  
"\.*(\d{+})M\b\"  
"\.*(\d{+?})M\b\"  
"\.*(\d){+}M\b\"  
"\.*(\d){+?}M\b\"  

"\.*(\d+\d+)M\b\" 

Potential solutions which I would request help with

  • Perhaps I just haven't tested the correct extract_string yet. Ideas?
  • Perhaps my cat("s/&.*", extract_string, ".*$/$1/") needs to be modified. Ideas?
  • Perhaps I need to use prxpson(prxmatch(prxparse())) instead of prxchange. How would that be formulated?

Links I've looked at but have not been able to successfully implement

https://support.sas.com/rnd/base/datastep/perl_regexp/regexp-tip-sheet.pdf

https://www.pharmasug.org/proceedings/2013/CC/PharmaSUG-2013-CC35.pdf

SAS PRX to extract substring please

extracting substring using regex in sas

Extract substring from a string in SAS

SOLUTIONS

Solution 1

The suffix in the cat function and the extract_string were modified.

    data test;  
        extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";
        my_result1 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "A1M_PRE");
        my_result2 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "AC2_0M");
        my_result3 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "GA3_30M");
        my_result4 = prxchange(cat("s/^.*", extract_string, "/$1/"), -1, "DE3_1H30M");
    run;

Solution 2

This solution uses the other prx-family functions: prxparse, prxmatch, and prxposn.

data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;
Jayden.Cameron
  • 499
  • 4
  • 17
  • 1
    Does [**this**](https://regex101.com/r/rBBQVd/1) help? –  Jun 10 '20 at 06:18
  • 1
    You could use a capturing group `(\d+)M$` https://regex101.com/r/RcFOZ8/1 If you want to remove all and keep the group 1 value `^.*?(?:(\d+)M)?$` and replace with `$1` https://regex101.com/r/e7lBCX/1 – The fourth bird Jun 10 '20 at 07:19
  • I like the site you both referred to, but I still cannot get any `extract_string` you suggested to work to give expected results in my code (updated in main question). Perhaps the prefix and suffix in `cat` are wrong? – Jayden.Cameron Jun 10 '20 at 07:49

3 Answers3

3

I understand that SAS can use Perl's regex engine. The latter supports \K, which directs the engine to discard everything matched so far and reset the starting point of the match to the current location. The following regular expression should therefore match the substring's digits that are of interest.

_.*?\K\d+(?=M$)

Demo

A failure to match would be interpreted as an empty string having been matched.

Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
  • Would I need to use functions other than prxchange? Using your `extract_string` in my code just returns the original search strings. – Jayden.Cameron Jun 10 '20 at 07:58
  • Jaden, I can't say because I don't know SAS. Perhaps a reader who does can help out. – Cary Swoveland Jun 10 '20 at 08:03
  • I just had a thought. Did you try it only with `"A1M_PRE"`, which the regex does not match? If so, please try it with one of other three examples you gave. – Cary Swoveland Jun 10 '20 at 08:06
  • BTW: I've reformatted the question for clarity. The `extract_string = "_.*?\K\d+(?=M$)"` fails to yield the desired results (all search strings are returned in their entirety). Perhaps it is due to my `cat` function? Or something else? I'm unsure. – Jayden.Cameron Jun 10 '20 at 08:11
  • Jayden, I'm frustrated that I can't help you with the SAS code. Hopefully a SASSY person (I assume that's what you call yourselves) will stop by this thread and help out. – Cary Swoveland Jun 10 '20 at 08:13
  • :D :P I'm new to SAS and have loved Stack Overflow for R and Python tips. I'm not sure if it's the best place for SAS help yet, but I'm still figuring out what my resources are. Thanks for your attempts to help! – Jayden.Cameron Jun 10 '20 at 08:20
2

If you want remove from the line and keep the digits preceding M at the end of the line, you could use a capturing group. In the replacement keep the value of group 1 $1

^.*?(?:_[^_r\n]*?(\d+)M)?$

Explanation

  • ^ Start of string
  • .*? Match any char as least as possible
  • (?: Non capture group
    • _[^_r\n]*? Match _ and any char except an underscore
    • (\d+)M Capture group 1, match 1+ digits followed by M
  • )? Close group and make it optional
  • $ End of string

Regex demo


You could make the extract_string the full pattern:

extract_string = "^.*?(?:_[^_r\n]*?(\d+)M)?$";
my_result1 = prxchange(cat("s/", extract_string, "/$1/"), -1, "A1M_PRE");

Or if you must keep the leading ^.* use

extract_string = "?(?:_[^_r\n]*?(\d+)M)?$";
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
  • @CarySwoveland Yes you are right, let me look into that. – The fourth bird Jun 10 '20 at 07:55
  • I'm afraid this answer is not formatted for how SAS handles regex; it returns errors if I use your `extract_string` in my code. Perhaps it can be modified to work? Or else I need to use functions other than prxchange? In which case, I'm still at a bit of a loss. – Jayden.Cameron Jun 10 '20 at 08:01
  • @Jayden.Cameron I don't know sas either, I looked at the example links you added in the question and I would expect this to work. What are the errors that you get? – The fourth bird Jun 10 '20 at 08:12
  • `ERROR: A variable was found in regular expression "s/^.*^.*?(?:_[^_r\n]*?(\d+)M)?$.*$/$1/". Variables within regular expressions are not supported.` – Jayden.Cameron Jun 10 '20 at 08:19
  • 1
    But this pattern `^.*?(?:_[^_r\n]*?(\d+)M)?$` is different from that pattern that you tried `^.*^.*?(?:_[^_r\n]*?(\d+)M)?$.*$` I think it is due to the duplicate anchors. Can you try just the patterns that we have provided? – The fourth bird Jun 10 '20 at 08:20
  • 1
    If I am not mistaken, the code could be `my_result1 = prxchange(cat("s/^.*?(?:_[^_r\n]*?(\d+)M)?$/$1/"), -1, "A1M_PRE");` to test it. – The fourth bird Jun 10 '20 at 08:24
  • 1
    `extract_string = "^.*?(?:_[^_r\n]*?(\d+)M)?$";` yields `ERROR: A variable was found in regular expression "s/^.*^.*?(?:_[^_r\n]*?(\d+)M)?$.*$/$1/". Variables within regular expressions are not supported.` The final solution worked (dropping my prefix and suffix from the `cat` function). I'll update the original post to reflect this. Thanks a ton for your help! (I'll mark this as the solution if you update your post to reflect that I needed to modify the `cat` argument). – Jayden.Cameron Jun 10 '20 at 08:29
  • 1
    @Jayden.Cameron I have added an update. I think you might also use the full pattern. – The fourth bird Jun 10 '20 at 08:58
1

Use PRXPOSN to extract a match group.

Example:

Use pattern /_.*?(\d+)M\s*$/ to locate the last run of digits before a terminating M character.

Regex:

  • _ literal underscore
  • .*? non-greedy any characters
  • (\d+) capture one or more digits
  • M literal M
  • \s*$ - any number of trailing spaces, needed due to SAS character values being right padded with spaces to variable attribute length
data have;
  length string $10;
  input string;
  datalines;
A1M_PRE
AC2_0M
GA3_30M
DE3_1H30M
;

data want;
  set have;

  rxid = prxparse ('/_.*?(\d+)M\s*$/');

  length digit_string $8;

  if prxmatch (rxid, string) then digit_string = prxposn(rxid,1,string);

  number_extracted = input (digit_string, ? 12.);
run;

Result

enter image description here

Richard
  • 25,390
  • 3
  • 25
  • 38
  • Thanks for the SAS-based response, @Richard, :). Is the use of `prxchange` vs `prxmatch(prxparse(), string)` with `prxposn(prxparse(), 1, string)` down to preference, or is there more to it that I'm missing? – Jayden.Cameron Jun 10 '20 at 14:40
  • For me it would be more a preference (or even coin-toss) by situation. Do I want to focus on what I want versus what I don't want... If I were processing m/billions of strings in a production job I would first benchmark for the best performing approach and also try non-regex extractions (SCAN/VERIFY/ANYDIGIT/etc). – Richard Jun 10 '20 at 14:54