2

I have a sas log file and I want to list only those lines that are between two words: data and run.

File can contain many such words in many lines, for example:

MPRINT: data xxxxx;
yyyyy
xxxxxx
MPRINT: run;

fffff
yyyyy

data fff;
fffff
run;

I would like to have lines 1-4 and 8-10.

I tried something like egrep -iz file -e '\sdata\s+\S*\s+(.|\s)*\srun\s' but this expression lists all lines between first begin and last end ((.|\s) is for the purpose of new line character).

I may also want to add additional words to pattern between data and run like:

MPRINT: data xxx;
fffff
NOTE: ffdd
set fff;
xxxxxx
MPRINT: run;

data fff;
yyyyyy
run;

In some cases I would like to list only lines between data and run where there is set word in some line.

I know there are many similar threads, but I didn't find any when keywords can repeat multiple times. I'm not familiar awk or sed but if it can help I can also use it.

[Edit]
Note that data and run are not necessarily on the beginning of the line (I updated the example). Also there can't be any other data between data and run.

[Edit2]
As Tom noted every line that I was looking for started with MPRINT(...):, so filtered those lines.
Anubhava answer helped me the most with my final solution so I mark it as an answer.
Final expression looked like this :

grep -o path -e 'MPRINT.*' | cut -f '2-' -d ' '| 
grep -iozP '(?ms) data [^\(;\s]+.*?(set|infile).*?run[^\n]*\n
Mr Patience
  • 1,564
  • 16
  • 30
  • You can use [`data[\s\S]+?run`](https://regex101.com/r/VHA6bx/1/) to list line between `data and run`, i am not sure about this `where there is set word in some line.` you want it to be must in match ? – Code Maniac Jul 16 '19 at 10:21
  • 1
    If you are searching only for lines generated by MPRINT option then it is simplier since both the DATA statement and the RUN statement should be on seperate lines. Note that the SAS logs is not going to have `MLOGIC:` instead of `MPRINT:` in front of the line with the `run;` statement. But unless you have rigid control on your SAS coding standards you cannot be certain that every data step will have a RUN statement. – Tom Jul 16 '19 at 13:08

3 Answers3

2

You may use this gnu grep command witn -P (PCRE) option:

grep -ozP '(?ms).*?data .*?run[^\n]*\n' file

If you only want to print block with line starting from set then use:

grep -ozP '(?ms).*?data .*?^set.*?run[^\n]*\n' file

MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;

You may use this awk to print between 2 keywords that must contain a line starting with set:

awk '/data / {
   p=1
}
p && !y {
if (/^set/)
   y=1
else
   buf = buf $0 ORS
}
y {
   if (buf != "")
      printf "%s", buf
   buf=""
   print
}
/run/ {
   p=y=0
}' file

MPRINT: data xxxxx;
yyyyy
set fff;
xxxxxx
MLOGIC: run;

If you just want to print data between 2 keywords in awk, it is so simple:

awk '/data /,/run/' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
1

For what i understand the following will do the trick

sed -n '/data.*;/,/run;/p' $FILENAME

Note that the '.*' after data can be improved by something like [a-z|A-Z]{5} that you protect against matching the word data somewhere in the middle

From there matching from data to set would already require some external decision processes, so the command would be

sed -n '/data.*;/,/set.*;/p' $FILENAME

(Probably learned along the way from How to use sed/grep to extract text between two words?)

0

Just try (?s)data.+?run;

Explanation:

(?s) - single line mode, . matches newline character

data - match data literally

.+? - match one or more of any character (including neline), non-greedy due to ?

run; - match run; literally

Demo

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69