6

I am new to shell scripting, it would be great if I can get some help with the question below.

I want to read a text file line by line, and print all matched patterns in that line to a line in a new text file.

For example:

$ cat input.txt

SYSTEM ERROR: EU-1C0A  Report error -- SYSTEM ERROR: TM-0401 DEFAULT Test error
SYSTEM ERROR: MG-7688 DEFAULT error -- SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error -- ERROR: MG-3218 error occured in HSSL
SYSTEM ERROR: DN-0A00 Error while getting object -- ERROR: DN-0A52 DEFAULT Error
SYSTEM ERROR: EU-1C0A  error Failed to fill in test report -- ERROR: MG-7688

The intended output is as follows:

$ cat output.txt

EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

I tried the following code:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs
done < input.txt > output.txt

which produced this output:

EU-1C0A TM-0401 MG-7688 DN-0A00 DN-0A52 MG-3218 DN-0A00 DN-0A52 EU-1C0A MG-7688 .......

Then I also tried this:

while read p; do
    grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs > output.txt
done < input.txt

But did not help :(

Maybe there is another way, I am open to awk/sed/cut or whatever... :)

Note: There can be any number of Error codes (i.e. XX:XXXX, the pattern of interest in a single line).

jww
  • 97,681
  • 90
  • 411
  • 885
Dinesh Kumar
  • 105
  • 1
  • 8
  • 2
    Any time you're considering using a shell loop to maniplulate text read [why-is-using-a-shell-loop-to-process-text-considered-bad-practice](http://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice) to understand some, but not all, of the reasons not to do that. – Ed Morton Dec 11 '16 at 17:39

8 Answers8

5
% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

Explanation in longform:

awk '
    BEGIN{ RS=": " } # Set the record separator to colon-space
    NR>1 {           # Ignore the first record
        printf("%s%s", # Print two strings:
            $1,      # 1. first field of the record (`$1`)
            ($0~/\n/) ? "\n" : " ")
                     # Ternary expression, read as `if condition (thing
                     # between brackets), then thing after `?`, otherwise
                     # thing after `:`.
                     # So: If the record ($0) matches (`~`) newline (`\n`),
                     # then put a newline. Otherwise, put a space.
    }
' input.txt 

Previous answer to the unmodified question:

% awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt 
EU-1C0A TM-0401
MG-7688 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

edit: With safeguard against :-injection (thx @e0k). Tests that the first field after the record seperator looks like how we expect it to be.

awk 'BEGIN{RS=": "};NR>1 && $1 ~ /^[A-Z]{2}-[A-Z0-9]{4}$/ {printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt
joepd
  • 4,681
  • 2
  • 26
  • 27
  • 1
    GNU `awk` accepts that as written; BSD (macOS) `awk` wants `(NR%2==1)` with the parentheses added (I'm not entirely sure why, but it yields a syntax error with the parentheses missing). This then works with both. – Jonathan Leffler Dec 09 '16 at 19:38
  • Hi, joepd, I am interested in your awk solution. – Dinesh Kumar Dec 09 '16 at 19:40
  • However I might have any number of Error codes (I.e. XX-XXXX) in a line, will your suggestion still work in that case ? – Dinesh Kumar Dec 09 '16 at 19:42
  • This solution will only work for this exact format of output. If there were more error codes it wouldn't catch them. – Stats4224 Dec 09 '16 at 19:54
  • Thanks Stats4224, I have edited my question with a note – Dinesh Kumar Dec 09 '16 at 20:01
  • Hi Joepd, would you be so kind to ellaborate on your last edit a bit more, I would like learn how it works :) – Dinesh Kumar Dec 09 '16 at 20:28
  • 4
    This solution exploits how each error code in the example is preceded by a `: `. If this string `: ` appears for any other reason, it would print something other than an error code (a false positive). There is no attempt to match the error code to a regular expression. – e0k Dec 09 '16 at 20:47
  • This will still only catch these error codes if they are of the format: ` : : ` but never: `: ` (the third code will be skipped). That said this has some really sweet awk nuggets of knowledge in it. – Stats4224 Dec 09 '16 at 21:04
  • 1
    @JonathanLeffler no, this will not work in BSD awk, it just appears to with a given sample input. The reason for that is the multi-char RS value is only supported by gawk. BSD awk will strip the blank char and just use `:`. That's fine if you really just want `:` to be the RS but wrong if you really want `:` as written. – Ed Morton Dec 11 '16 at 17:25
  • @ joepd, the following command worked: awk 'BEGIN{RS=": "};NR>1{printf "%s%s", $1, (NR%2==1)?"\n":" "}' input.txt but as predicted by e0k this was not robust and resulted in other text also be printed which fir the ": " pattern Then I tried your last solution with the built-in safeguard: unfortunately this one did not extract any error codes at all :(, the command just ran successfully and no errors where printed, any idea ?? – Dinesh Kumar Dec 12 '16 at 13:38
  • Ha, you might want to set `RS` to `ERROR: `. It seems like the error messages from your example always start with this. The command would become: `awk 'BEGIN{RS="ERROR: "};NR>1{printf "%s%s", $1, ($0~/\n/)?"\n":" "}' input.txt` – joepd Dec 12 '16 at 16:25
4

There's always perl! And this will grab any number of matches per line.

perl -nle '@matches = /[A-Z]{2}-[A-Z0-9]{4}/g; print(join(" ", @matches)) if (scalar @matches);' output.txt

-e perl code to be run by compiler and -n run one line at a time and -l automatically chomps the line and adds a newline to prints.

The regex implicitly matches against $_. So @matches = $_ =~ //g is overly verbose.

If there is no match, this will not print anything.

Stats4224
  • 778
  • 4
  • 13
2

You could always keep it extremely simple:

$ awk '{o=""; for (i=1;i<=NF;i++) if ($i=="ERROR:") o=o$(i+1)" "; print o}' input.txt
EU-1C0A TM-0401
MG-7688 DN-0A00 DN-0A52 MG-3218
DN-0A00 DN-0A52
EU-1C0A MG-7688

The above will add a blank char to the end of each line, trivially avoided if you care...

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thanks Ed, its a simple and elegant solution, and fortunately in my case the error codes are always preceded with a "ERROR:" – Dinesh Kumar Dec 12 '16 at 13:30
1

To keep your grep pattern, here's a way:

while IFS='' read -r p; do
    echo $(grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p")
done < input.txt > output.txt
  • while IFS='' read -r p; do is the standard way to read line-by-line into a variable. See, e.g., this answer.
  • grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' <<<"$p" runs your grep and prints the matches. The <<<"$p" is a "here string" that provides the string $p (the line that was read in) as stdin to grep. This means grep will search the contents of $p and print each match on its own line.
  • echo $(grep ...) converts the newlines in grep's output to spaces, and adds a newline at the end. Since this loop happens for each line, the result is to print each input line's matches on a single line of the output.
  • done < input.txt > output.txt is correct: you are providing input to, and taking output from, the loop as a whole. You don't need redirection within the loop.
Community
  • 1
  • 1
cxw
  • 16,685
  • 2
  • 45
  • 81
  • @DineshKumar Sorry to hear that! It worked on my cygwin installation. What did it do when you tried? This is definitely bash, not sh, which may make a difference. – cxw Dec 12 '16 at 14:57
1

Another solution that works if you know that every line will contain exactly two instances of the strings you want to match:

cat input.txt | grep -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' | xargs -L2 > output.txt
VHarisop
  • 2,816
  • 1
  • 14
  • 28
1

Here is a solution with awk that is fairly straightforward, but it is not an elegant one-liner (as many awk solutions tend to be). It should work with any number of your error codes per line, and with an error code defined as a field (white space separated word) that matches a given regex. Since it's not a snazzy one-liner, I stored the program in a file:

codes.awk

#!/usr/bin/awk -f
{
    m=0;
    for (i=1; i<=NF; ++i) {
        if ( $i ~ /^[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]$/ ) {
            if (m>0) printf OFS
            printf $i
            m++
        }
    }
    if (m>0) printf ORS
}

You would run this like

$ awk -f codes.awk input.txt

I hope you find it fairly easy to read. It runs the block once for each line of input. It iterates over each field and checks if it matches a regular expression, then prints the field if it does. The variable m keeps track of the number of matched fields on the current line so far. The purpose of this is to print the output field separator OFS (a space by default) between the matched fields only as needed and to use the output record separator ORS (a newline by default) only if there was at least one error code found. This prevents unnecessary white space.

Notice that I have changed your regular expression from [A-Z]{2}-[A-Z0-9]{4} to [A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]. This is because old awk will not (or at least may not) support interval expressions (the {n} parts). You could use [A-Z]{2}-[A-Z0-9]{4} with gawk, however. You can tweak the regex as needed. (In both awk and gawk, regular expressions are delimited by /.)

The regex /[A-Z]{2}-[A-Z0-9]{4}/ would match any field that contains your XX-XXXX pattern of letters and digits. You want the field to be a full match to the regex and not just include something that matches that pattern. To do this, the ^ and $ marks the beginning and end of the string. For example, /^[A-Z]{2}-[A-Z0-9]{4}$/ (with gawk) would match US-BOTZ, but not USA-ROBOTS. Without the ^ and $, USA-ROBOTS would match because it includes a substring SA-ROBO that does match the regex.

e0k
  • 6,961
  • 2
  • 23
  • 30
1

Parsing grep -n with AWK

grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | awk -F: -vi=0 '{
  printf("%s%s", i ? (i == $1 ? " " : "\n") : "", $2)
  i = $1
}'

The idea is to join the lines from the output of grep -n:

1:EU-1C0A
1:TM-0401
2:MG-7688
2:DN-0A00
2:DN-0A52
2:MG-3218
3:DN-0A00
3:DN-0A52
4:EU-1C0A
4:MG-7688

by the line numbers. AWK initializes the field separator (-F:) and the i variable (-vi=0), then processes the output of the grep command line by line.

It prints a character depending on conditional expression that tests the value of the first field $1. If i is zero (the first iteration), it prints only the second field $2. Otherwise, if the first field equals to i, it prints a space, else a newline ("\n"). After the space/newline the second field is printed.

After printing the next chunk, the value of the first field is stored into i for the next iterations (lines): i = $1.

Perl

Parsing grep -n in Perl

use strict;
use warnings;

my $p = 0;

while (<>) {
  /^(\d+):(.*)$/;
  print $p == $1 ? " " : "\n" if $p;
  print $2;
  $p = $1;
}

Usage: grep -n -o '[A-Z]\{2\}-[A-Z0-9]\{4\}' file | perl script.pl.

Single Line

But Perl is actually so flexible and powerful that you can solve the problem completely with a single line:

perl -lne 'print @_ if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < file

I've seen a similar solution in one of the answers here. Still I decided to post it as it is more compact.

One of the key ideas is using the -l switch that

  1. automatically chomps the input record separator $/;
  2. assigns the output record separator $\ to have the value of $/ (which is newline by default)

The value of output record separator, if defined, is printed after the last argument passed to print. As a result, the script prints all matches (@_, in particular) followed by a newline.

The @_ variable is usually used as an array of subroutine parameters. I have used it in the script only for the sake of shortness.

Ruslan Osmanov
  • 20,486
  • 7
  • 46
  • 60
  • This worked perfectly, only that I needed a small modification, to add the spaces
    perl -lne 'print(join(" ", @_)) if @_ = /([A-Z]{2}-[A-Z\d]{4})/g' < input.txt the power of perl indeed :)
    – Dinesh Kumar Dec 12 '16 at 13:56
0

In Gnu awk. Supports multiple matches on each record:

$ awk '
{
    while(match($0, /[A-Z]{2}-[A-Z0-9]{4}/)) {  # find first match on record
        b=b substr($0,RSTART,RLENGTH) OFS       # buffer the match
        $0=substr($0,RSTART+RLENGTH)            # truncate from start of record
    }
    if(b!="") print b                           # print buffer if not empty
    b=""                                        # empty buffer
}' file
EU-1C0A TM-0401 
MG-7688 DN-0A00 DN-0A52 MG-3218 
DN-0A00 DN-0A52 
EU-1C0A MG-7688 

Downside: there will be an extra OFS in the end of each printed record.

If you want to use other awks than Gnu awk, replace the regex match with:

while(match($0, /[A-Z][A-Z]-[A-Z0-9][A-Z0-9][A-Z0-9]/))
James Brown
  • 36,089
  • 7
  • 43
  • 59