3

I have the following text:

aaa rr tt zz pp
aaa pp xx yy uu zz

And need to extract all 'aaa', 'zz' and 'xx' pattern and print them on one line like this:

aaa zz
aaa xx zz

Best I found is grep -oP 'aaa|xx|zz' but this return each pattern found on a new line:

aaa
zz
aaa
xx
zz

I tried to add something like tr -d '\n' but in that case it returns the whole matches on single line which is not what I want.

NB: I need a solution which support regexp with non-greedy regexp as the search patterns would looks like: ^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
xtruder99
  • 61
  • 5

2 Answers2

1

Assuming you have grep -P, here is a simple Awk postprocessor to rearrange the output into the desired format.

grep -Pno '^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]' - /dev/null <file |
awk 'BEGIN { re="^\\(standard input\\):[1-9][0-9]*:" }
    $0 ~ re { sep="\n"; sub(re, "") }
    { if(NR>1) printf "%s", sep; printf "%s", $0; sep=" " }
    END { if(sep) printf "\n" }'

If the grep results could accidentally output a prefix which looks like (standard input):1: from an actual match, this won't work.

This is from BSD grep; if your local grep outputs a differently formatted file name prefix for standard input (or if you need to refactor to read a number of named files instead of standard input), the Awk regex will need to be adapted accordingly.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • Thanks Tripleee, I tried several machine but could not make it work. However, the idea of adding the -n on grep was very good, so I made my own version: awk -F':' '{curr = $1; sub("^[1-9][0-9]*:", ""); if(prev && prev!=curr) {print STR; STR=$0;} else {if (!prev) STR=$0; else STR=STR " " $0;} prev=curr;} END {print STR}' – xtruder99 Jan 29 '20 at 11:27
1

You may use

 while IFS= read -r line; do
   echo $(grep -oP 'aaa|xx|zz' <<< "$line");
 done < file

That is,

  1. Read input file line by line
  2. Get your matches with the grep command per each line
  3. The shell will convert the newlines with spaces as the $(...) is not enclosed with double quotes.

If you have specific whitespace inside matches that you want to preserver, consider using

while IFS= read -r line; do 
  echo "$(grep -oP 'aaa|xx|zz' <<< "$line" | awk '{ printf "%s", $0" "}')"; 
done < file

This way, you will get per-line matches in a space-separated way. You may use any custom delimiter in the awk command (after $0).

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • You don't need the Awk if you use the (then not) useless `echo`, or vice versa. – tripleee Jan 29 '20 at 10:06
  • @tripleee If you could provide a variation of the command you mean it would be helpful. I tried without, but this was the only one that worked. – Wiktor Stribiżew Jan 29 '20 at 10:08
  • 1
    `while IFS= read -r line; do echo $(grep -oP 'aaa|xx|zz' <<<"$line"); done <<$'aaa bb cc\nzz xx yy\nboo baa` – tripleee Jan 29 '20 at 10:11
  • 1
    Well not exactly; the *shell* will flatten whitespace when you pass an unquoted string such as the result of a command substitution. Equivalently `echo "$(grep -oP 'aaa|xx|zz' | awk '{ printf "%s", $0 }')"` would work, and properly quote the output from the shell. This is the better approach if the output from `grep` could contain irregular whitespace and/or unquoted wildcards. See also https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable – tripleee Jan 29 '20 at 10:32
  • @tripleee Yes, OP sample input is too simplified – Wiktor Stribiżew Jan 29 '20 at 10:36
  • Thanks, this works very fine but it is extremely slow (when using complex regexp) – xtruder99 Jan 29 '20 at 11:25
  • @xtruder99 That might be a problem with regex. Your `^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]` is very inefficient, and I a m afraid wrong. What do you want to match with `[$,]`? Do you realize it matches `$` or `,` symbols? If you mean end of string or `,` you need `^[^,]*,|,IN:[^-]*-|,OUT:[^-]*|State[^,]*,`. Also, no need to use `P`, use `E` with this pattern, as there are no PCRE specific constructs any longer. If input text is long, lazy dot pattern may be a really dangerous pattern. – Wiktor Stribiżew Jan 29 '20 at 11:32
  • 1
    @WiktorStribiżew well your argument on the pertinence of my regexp are not receivable, but anyway I have done more tests and indeed the performance are terrible on cygwin (like processing 3 lines/seconds) but not really noticeable on a native linux. So I'll accept that one :) – xtruder99 Jan 29 '20 at 14:00