Extracting info from file rows into columns using whatever it works (PERL, SED, AWK)

Question

Maybe I´m too old for perl/awk/sed, too young to stop programming. Here is the problem I need to solve:

I have info like this in a TXT file:

Name:
Name 1
Phone:
1111111
Email:
some@email1
DoentMatterInfo1:
whatever1
=
Name:
Name 2
Phone:
22222222
DoentMatterInfo2:
whatever2
Email:
some@email2
=
Name:
Name 3
DoentMatterInfo3:
whatever2
Email:
some@email3
=

Please note that the desired info is in the next line, there is a record separator (=) and very important, some records doesn't have all the info, but could have info that we dont want.

So, the challenge is to extract the desired info, if exist, in an output like:

Name 1 ; 111111 ; some@email1
Name 2 ; 222222 ; some@email2
Name 3 ; ; some@email3

What I have tried that worked a little bit but stills is not what I´m looking for.

1. Using PERL

Using Perl I got the fields that matter:

while (<>) {

    if ($_ =~ /Name/) {
        print "=\n". scalar <>;

    }    
    if ($_ =~ /Email/) {
        print "; ". scalar <>;

    } 
    if ($_ =~ /Phone/) {
        print "; ". scalar <>;

    } 

}

The I got a file like:

Name 1
; 1111111
; some@email1
=
Name 2
; 22222222
; some@email2
=
Name:
Name 3
; some@email3
=

Now with sed I put each record in a single line:

SED With SED, this command replaces the Line Feed, got the info in a single line: sed ':a;N;$!ba;s/\n//g' input.txt > out1.txt

And out back the line feed:

sed 's/|=|/\n/g' out1.txt > out2.txt

So I got a file with the info in each line:

Name 1 ; 1111111 ; some@email1
Name 2 ; 22222222 ; some@email2
Name 3 ; some@email3

Still not what I would like to get from coding. I want something better, like being able to fill the missing phone with space, so the second column could be always the phone column. Do you get it?

AS you can see, the poitn is to find a solution, no matter if is using Perl, AWk or SED. I´m trying perl hashes...

Thanks in advance!!

Thank you for updating the example in your question. Ive reformatted the sample input/output and code in your question 3 times so far, if you need to make any other changes to it after this please just use the same formatting I'm using where the blocks are simply indented 4 spaces. See https://stackoverflow.com/editing-help. — Ed Morton, May 22 '20 at 22:48
I´m trying PERL to pragmatically list that rubish. So, Perl explores the file and try to put the info in a hash of hashes... but doenst works finel yeat, because I´not having the correct key.... please take a look: — Luis Cáceres, May 22 '20 at 22:56
$lead =0; while (<>) { if ($_ =~ /Name/) { #print "=\n". scalar <>; $leads{$lead}{Name}=scalar <>; } if ($_ =~ /Email/) { #print "; ". scalar <>; $leads{$lead}{Email}=scalar <>; } if ($_ =~ /Phone/) { #print "; ". scalar <>; $leads{$lead}{Phone}=scalar <>; } $lead++; } for(keys %leads){ print("Email de $_ is $leads{$_}{Email}\n"); print("Nane de $_ is $leads{$_}{Name}\n"); print("Phone de $_ is $leads{$_}{Phone}\n"); } — Luis Cáceres, May 22 '20 at 22:57
In the 40 years I've been programming in UNIX I haven't yet personally come across a use for perl (for a large part of my career I've only had access to standard UNIX tools like awk for text processing) and so, unfortunately, I'm not familiar with it's syntax. If you have a perl script you specifically want help with then post another question and just tag it with perl, but if you just want to know how to solve this problem then you certainly don't need perl to do it. — Ed Morton, May 22 '20 at 22:58
Question for clarification. You have "fields" (in a relational database they would be called "columns") like **Name**, **Phone**, **Email** and others. Are you saying that you only want to extract name, phone and email, and disregard everything else? — , May 23 '20 at 00:57
More questions... Which of the following unusual situations are possible, and should they be handled (and how)? (1) File ends without an `=` terminating row. (2) The file has two consecutive `=` rows (meaning a record where all fields are `null`) - should those be reflected in the output? (3) A `Name:` row **not** followed by a name (but, for example, followed immediately by a `Phone:` row). This could perhaps be assumed to be equivalend to a record without a `Name:` label in the first place. (4) The last character of the file is not a newline. (5) Some of the "values" span over several lines. — , May 24 '20 at 00:04

zdim · Accepted Answer · 2020-05-25T22:55:33.403

Here is a Perl solution, asked for and attempted

use warnings;
use strict;
use feature 'say';

my @fields = qw(Name Phone Email);  # fields to process

my $re_fields = join '|', map { quotemeta } @fields;

my %record;

while (<>) { 
    if (/^\s*($re_fields):/) { 
        chomp($record{$1} = <>);
    }
    elsif (/^\s*=/) { 
        say join ';', map { $record{$_} // '' } @fields;
        %record = (); 
    }   
}

The input is prepared in the array @fields; this is the only place where those names are spelled out, so if more fields need be added to processing just add them here. A regex pattern for matching any one of these fields is also prepared, in $re_fields.

Then we read line by line all files submitted on the command line, using the <> operator.

The if condition captures an expected keyword if there. In the body we read the next line for its value and store it with the key being the captured keyword (need not know which one).

On a line starting with = the record is printed (correctly with the given sample file). I put nothing for missing fields (no spaces) and no extra spaces around ;. Adjust the output format as desired.

In order to collect records throughout and process further (or just print) later, add them to a suitable data structure instead of printing. What storage to choose depends on what kind of processing is envisioned. The simplest way to go is to add strings for each output record to an array

my (@records, %record);

while (<>) {
    ...
    elsif (/^\s*=/) { 
        push @records, join ';', map { $record{$_} // '' } @fields;
        %record = (); 
    }   
}

Now @records has ready strings for all records, which can be printed simply as

say for @records;

But if more involved processing may be needed then better store in an array copies of %record as hash references, so that individual components can later be manipulated more easily

my (@records, %record);

while (<>) {
    ...
    elsif (/^\s*=/) { 
        # Add a key to the hash for any fields that are missing
        $record{$_} //= ''  for @fields;
        push @records, { %record };
        %record = (); 
    }   
}

I add a key for possibly missing fields, so that the hashrefs have all expected keys, and I assign an empty string to it. Another option is to assign undef.

Now you can access individual fields in each record as

foreach my $rec (@records) { 
    foreach my $fld (sort keys %$rec) {
        say "$fld -> $rec->{$fld}"
    }
}

or of course just print the whole thing using Data::Dumper or such.

Beautirful!Thats why I love PERL! I tryed this code quickly on tutorialspoint.com/execute_perl_online.php a works great! Below I will show the not-so-elegant solution that I got. Actualy quick a dirty. — Luis Cáceres, May 25 '20 at 06:07
Dear @zdim, I´m having difficulties manipulating the final hash.... sorry about that! Could help to print the content in a loop (I know is not elegant)...? — Luis Cáceres, May 25 '20 at 18:01
@LuisCáceres I was going to add code for storing all records so that they can all be manipulated later, and there are various ways to do that ... so: how do you need to "manipulate" it? What do you want to do with it? — zdim, May 25 '20 at 20:39
Well, so far, just print it in a single line per record. I my dirty way of solving I consider adding a new field, ID, just as counter. But is becuase of my difficulty of manipulating the hashes,,,, — Luis Cáceres, May 25 '20 at 21:26
@LuisCáceres Added two ways to store records for later processing. One makes it trivial to print (put ready record-lines on an array), the other is better if you need to manipulate individual fields, or add fields, etc (copy those hashes in each iteration into an an array, as hashrefs). — zdim, May 25 '20 at 22:57
Thank you @zdim! Definitively, Perl is for very wise minds. I´m pleased to see brilliant solitions, but I´m also concerned because today the average programmer is unable to get such ideas. Without mentioning that millenials would feel "offeded" by the compiler messages! hehehe Thank very much, I´m learning a lot! — Luis Cáceres, May 26 '20 at 00:19

Ed Morton · Answer 2 · 2020-05-22T23:13:40.093

2

This will work using any awk in any shell on every UNIX box:

$ cat tst.awk
BEGIN { OFS=" ; " }
$0 == "=" {
    print f["Name:"], f["Phone:"], f["Email:"]
    delete f
    lineNr = 0
    next
}
++lineNr % 2 { tag = $0; next }
{ f[tag] = $0 }

.

$ awk -f tst.awk file
Name 1 ; 1111111 ; some@email1
Name 2 ; 22222222 ; some@email2
Name 3 ;  ; some@email3

edited May 22 '20 at 23:13

answered May 22 '20 at 22:52

Ed Morton

188,023
17
78
185

score 2 · Answer 3 · 2020-05-23T06:25:28.130

2

I would do it like this:

$ cat prog.awk

#!/bin/awk -f
BEGIN                    { OFS = ";" }
/^(Name|Phone|Email):$/  { getline arr[$0] ; next } 
/^=$/  { print arr["Name:"], arr["Phone:"], arr["Email:"] ; delete arr }

Explanation:

In the BEGIN block, define the output field separator (semicolon).

For each line in the input file, if the line (in its entirety) equals Name: or Phone: or Email: then assign that string to the key and the value of the following line to the value of an element of the associative array arr. (That is how getline can be used to assign a value to a variable.) Then skip the next rule.

If the line is =, print the three values from the arr associative array, and then clear out the array (reset all the values to the empty string).

* * * *

Make it executable:

chmod +x prog.awk

Use it:

$ ./prog.awk file.txt 

Name 1;1111111;some@email1
Name 2;22222222;some@email2
Name 3;;some@email3

Note - a missing value is indicated by two consecutive semicolons (not by a space). Using space as placeholder for NULL is a common bad practice (especially in relational databases, but in flat files too). You can change this to use NULL as placeholder, I am not terribly interested in that bit of the problem.

edited May 23 '20 at 06:25

answered May 23 '20 at 04:44

Also a very nice solution, but on a gigabyte file I suspect @EdMorton would have a slight edge on efficiency due to the use of `getline`. I would have to time it to be sure. Both good solutions. – David C. Rankin May 23 '20 at 06:41
@DavidC.Rankin - if you care to see how this solution has evolved, you can look at the editing history (all the edits are mine). Originally I wasn't using `getline`. I wasn't aware that it is slow compared to other actions, is it? I thought I was improving my solution, but perhaps I made it worse instead! Another concern may be that assigning `getline` to a variable is supported by NAWK and GNU awk, but not, I believe, "old" awk. (An earlier version used standard `getline`, without assignment - that should work in all awk dialects.) – May 23 '20 at 06:44
I think it is an improvement and a slick way to consume the next record. My only concern is that the additional I/O resource switching between the normal record retrieval and retrieval with `getline` would inject a bit of delay -- but I can't tell you that for certain -- which is why I said I would have to test it. It is a very slick solution. (also, it's just damn hard to out do Ed on an `awk` solution. He and Charles breath that stuff daily `:)` – David C. Rankin May 23 '20 at 06:52
The problem with using getline is more the fragility (that way of calling getline will silently duplicate previous success values on a getline failure and so can produce output lines that weren't present in the input) and difficulty of enhancing (try, for example adding a debugging statement to print every input line - without getline that's `{ print }`, with getline it's ugly, repetitive code). See http://awk.freeshell.org/AllAboutGetline. – Ed Morton May 23 '20 at 13:46
@mathguy never worry about anything old awk does - it is old and broken and must never be used by anyone any time. The only time I see it bite people these days is because it's the default awk (/bin/awk) on Solaris but there you also have /usr/xpg4/bin/awk (best choice as closest to POSIX compliance) and nawk (not POSIX as no character classes etc. but basically functional and far better than old, broken awk). – Ed Morton May 23 '20 at 14:00
I'd also recommend you don't use a shebang to call awk from a shell script. See https://stackoverflow.com/a/61002754/1745001. – Ed Morton May 23 '20 at 15:15

Polar Bear · Answer 4 · 2020-05-24T03:17:06.230

2

Input file format is easy to parse: split on =\n into records, split each record on \n into a hash and push the hash into @result array.

Then just output each element of @result array with specifying fields of interest.

use strict;
use warnings;
use feature 'say';

use Data::Dumper;

my @result;
my $data    = do { local $/; <DATA> };
my @records = split('=\n?',$data);

push @result, {split "\n", $_} for @records;

say Dumper(\@result);

my @fields = qw/Name: Phone: Email:/;

for my $record (@result) {
    $record->{$_} = $record->{$_} || '' for @fields;
    say join('; ', @$record{@fields});
}

__DATA__
Name:
Name 1
Phone:
1111111
Email:
some@email1
DoentMatterInfo1:
whatever1
=
Name:
Name 2
Phone:
22222222
DoentMatterInfo2:
whatever2
Email:
some@email2
=
Name:
Name 3
DoentMatterInfo3:
whatever2
Email:
some@email3
=

Output

$VAR1 = [
          {
            'DoentMatterInfo1:' => 'whatever1',
            'Name:' => 'Name 1',
            'Email:' => 'some@email1',
            'Phone:' => '1111111'
          },
          {
            'Phone:' => '22222222',
            'Email:' => 'some@email2',
            'Name:' => 'Name 2',
            'DoentMatterInfo2:' => 'whatever2'
          },
          {
            'DoentMatterInfo3:' => 'whatever2',
            'Name:' => 'Name 3',
            'Email:' => 'some@email3'
          }
        ];

Name 1; 1111111; some@email1
Name 2; 22222222; some@email2
Name 3; ; some@email3

edited May 24 '20 at 03:17

answered May 24 '20 at 03:07

Polar Bear

6,762
1
5
12

Beautirful!Thats why I love PERL! I tryes this code quickly on https://www.tutorialspoint.com/execute_perl_online.php a works great! Below I wll show the not-so-elegant solution that I got. Actualy quick a dirty. – Luis Cáceres May 25 '20 at 06:04
I wrote a code using IF for each field, quite dirty. I´m just curious (sorry by asking) how to manipulate the output var,... – Luis Cáceres May 25 '20 at 07:14
@LuisCáceres - your question is not clear about the manipulation. Parsed data are stored in an array where each element represents a hash (key => value) representing data of the record. You can access data individually for example as $result[2]->{'Name:'} for second record field `Name:`. You can loop over `@result` array and output a particular `fields` like `say join(';', @record{qw/field1 field2 .. fieldn/})`. You'd better look at [hash slices](https://www.webquills.net/web-development/perl/perl-5-hash-slices-can-replace.html) to get better understanding. – Polar Bear May 25 '20 at 07:54
I really need a better understanding of hash slices. Actually I was snubbed by the compiler messages while trying to access the hash... It told me things like "There is better ways of doing that". I wa afraid that in the next message the compiler wanted to callme "your id10T" and spit on me !hehehehehe – Luis Cáceres May 25 '20 at 20:30

Extracting info from file rows into columns using whatever it works (PERL, SED, AWK)

4 Answers4