How to avoid having newlines with grep -o for multiple match at the same line (on a text of several lines)

Question

I have the following text:

aaa rr tt zz pp
aaa pp xx yy uu zz

And need to extract all 'aaa', 'zz' and 'xx' pattern and print them on one line like this:

aaa zz
aaa xx zz

Best I found is grep -oP 'aaa|xx|zz' but this return each pattern found on a new line:

aaa
zz
aaa
xx
zz

I tried to add something like tr -d '\n' but in that case it returns the whole matches on single line which is not what I want.

NB: I need a solution which support regexp with non-greedy regexp as the search patterns would looks like: ^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]

You may be able to do this simply with: `awk -F= '{ printf "%s", $0" "}' file` — Zlemini, Jul 19 '23 at 07:49

score 1 · Answer 1 · answered Jan 29 '20 at 09:35

Assuming you have grep -P, here is a simple Awk postprocessor to rearrange the output into the desired format.

grep -Pno '^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]' - /dev/null <file |
awk 'BEGIN { re="^\\(standard input\\):[1-9][0-9]*:" }
    $0 ~ re { sep="\n"; sub(re, "") }
    { if(NR>1) printf "%s", sep; printf "%s", $0; sep=" " }
    END { if(sep) printf "\n" }'

If the grep results could accidentally output a prefix which looks like (standard input):1: from an actual match, this won't work.

This is from BSD grep; if your local grep outputs a differently formatted file name prefix for standard input (or if you need to refactor to read a number of named files instead of standard input), the Awk regex will need to be adapted accordingly.

Thanks Tripleee, I tried several machine but could not make it work. However, the idea of adding the -n on grep was very good, so I made my own version: awk -F':' '{curr = $1; sub("^[1-9][0-9]*:", ""); if(prev && prev!=curr) {print STR; STR=$0;} else {if (!prev) STR=$0; else STR=STR " " $0;} prev=curr;} END {print STR}' — xtruder99, Jan 29 '20 at 11:27

Wiktor Stribiżew · Accepted Answer · 2020-01-29T10:41:54.933

1

You may use

 while IFS= read -r line; do
   echo $(grep -oP 'aaa|xx|zz' <<< "$line");
 done < file

That is,

Read input file line by line
Get your matches with the grep command per each line
The shell will convert the newlines with spaces as the $(...) is not enclosed with double quotes.

If you have specific whitespace inside matches that you want to preserver, consider using

while IFS= read -r line; do 
  echo "$(grep -oP 'aaa|xx|zz' <<< "$line" | awk '{ printf "%s", $0" "}')"; 
done < file

This way, you will get per-line matches in a space-separated way. You may use any custom delimiter in the awk command (after $0).

edited Jan 29 '20 at 10:41

answered Jan 29 '20 at 09:45

Wiktor Stribiżew

607,720
39
448
563

You don't need the Awk if you use the (then not) useless `echo`, or vice versa. – tripleee Jan 29 '20 at 10:06
@tripleee If you could provide a variation of the command you mean it would be helpful. I tried without, but this was the only one that worked. – Wiktor Stribiżew Jan 29 '20 at 10:08
1

`while IFS= read -r line; do echo $(grep -oP 'aaa|xx|zz' <<<"$line"); done <<$'aaa bb cc\nzz xx yy\nboo baa` – tripleee Jan 29 '20 at 10:11
1

Well not exactly; the *shell* will flatten whitespace when you pass an unquoted string such as the result of a command substitution. Equivalently `echo "$(grep -oP 'aaa|xx|zz' | awk '{ printf "%s", $0 }')"` would work, and properly quote the output from the shell. This is the better approach if the output from `grep` could contain irregular whitespace and/or unquoted wildcards. See also https://stackoverflow.com/questions/10067266/when-to-wrap-quotes-around-a-shell-variable – tripleee Jan 29 '20 at 10:32
@tripleee Yes, OP sample input is too simplified – Wiktor Stribiżew Jan 29 '20 at 10:36
Thanks, this works very fine but it is extremely slow (when using complex regexp) – xtruder99 Jan 29 '20 at 11:25
@xtruder99 That might be a problem with regex. Your `^.+?,|,IN:.+?\-|,OUT:.+?-|State.+?[$,]` is very inefficient, and I a m afraid wrong. What do you want to match with `[$,]`? Do you realize it matches `$` or `,` symbols? If you mean end of string or `,` you need `^[^,]*,|,IN:[^-]*-|,OUT:[^-]*|State[^,]*,`. Also, no need to use `P`, use `E` with this pattern, as there are no PCRE specific constructs any longer. If input text is long, lazy dot pattern may be a really dangerous pattern. – Wiktor Stribiżew Jan 29 '20 at 11:32
1

@WiktorStribiżew well your argument on the pertinence of my regexp are not receivable, but anyway I have done more tests and indeed the performance are terrible on cygwin (like processing 3 lines/seconds) but not really noticeable on a native linux. So I'll accept that one :) – xtruder99 Jan 29 '20 at 14:00

How to avoid having newlines with grep -o for multiple match at the same line (on a text of several lines)

2 Answers2