4

I would like to transpose a list of of items (key/value pairs) into a table format. The solution can be a bash script, awk, sed, or some other method.

Suppose I have a long list, such as this:

date and time: 2013-02-21 18:18 PM
file size: 1283483 bytes
key1: value
key2: value

date and time: 2013-02-21 18:19 PM
file size: 1283493 bytes
key2: value

...

I would like to transpose into a table format with tab or some other separator to look like this:

date and time   file size   key1    key2
2013-02-21 18:18 PM 1283483 bytes   value   value
2013-02-21 18:19 PM 1283493 bytes       value
...

or like this:

date and time|file size|key1|key2
2013-02-21 18:18 PM|1283483 bytes|value|value
2013-02-21 18:19 PM|1283493 bytes||value
...

I have looked at solutions such as this An efficient way to transpose a file in Bash, but it seems like I have a different case here. The awk solution works partially for me, it keeps outputting all the rows into a long list of columns, but I need for the columns to be constrained to a unique list.

awk -F': ' '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' filename

UPDATE

Thanks to all of you who providing your solutions. Some of them look very promising, but I think my version of the tools might be outdated and I am getting some syntax errors. What I am seeing now is that I did not start off with very clear requirements. Kudos to sputnick for being the first one to offer the solution before I spelled out the full requirements. I have had a long day when I wrote the question and thus it was not very clear.

My goal is to come up with a very generic solution for parsing multiple lists of items into column format. I am thinking the solution does not need to support more than 255 columns. Column names are not going to be known ahead of time, this way the solution will work for anyone, not just me. The two known things are the separator between kev/value pairs (": ") and a separator between lists (empty line). It would be nice to have a variable for those, so that they are configurable for others to reuse this.

From looking at proposed solutions, I realize that a good approach is to do two passes over the input file. First pass is to gather all the column names, optionally sort them, then print the header. Second to grab the values of the columns and print them.

Community
  • 1
  • 1
dabest1
  • 2,347
  • 6
  • 25
  • 25
  • Question has been updated. Thanks for a good article. I would not want to reinvent the wheel if someone has faced a similar situation already. – dabest1 Feb 22 '13 at 02:40
  • @dabest1, do you know the maximum number of columns? – perreal Feb 22 '13 at 04:02
  • Number of columns would be low initially, but I would like a generic solution which at least supports up to 255 columns. – dabest1 Feb 22 '13 at 04:21
  • @dabest1: Correct me if I'm wrong, but I'm assuming each record will still have its own 'date and time' and 'file size' headers. Only the number of keys/values will change. Is that correct? – Steve Feb 22 '13 at 04:29
  • @dabest1: If you have all rows with only key1 and key2 until the end of file but the last entry also contains a key3, do you want to print out the initial records as `date|size|key1|key2||`? – perreal Feb 22 '13 at 04:32
  • @perreal: I would assume so. But I'm not the OP... – Steve Feb 22 '13 at 04:33
  • @perreal: Actually, (either a typo or oversight) I think there should only be one trailing delimiter in that case: `2013-02-21 18:18 PM|1283483 bytes|value1|value2|` Is that correct? – Steve Feb 22 '13 at 04:39
  • @Steve, yea I think so – perreal Feb 22 '13 at 04:40

5 Answers5

2

Here's one way using GNU awk. Run like:

awk -f script.awk file

Contents of script.awk:

BEGIN {
    # change this to OFS="\t" for tab delimited ouput
    OFS="|"

    # treat each record as a set of lines
    RS=""
    FS="\n"
}

{
    # keep a count of the records
    ++i

    # loop through each line in the record
    for (j=1;j<=NF;j++) {

        # split each line in two
        split($j,a,": ")

        # just holders for the first two lines in the record
        if (j==1) { date = a[1] }
        if (j==2) { size = a[1] }

        # keep a tally of the unique key names
        if (j>=3) { !x[a[1]] }

        # the data in a multidimensional array:
        # record number . key = value
        b[i][a[1]]=a[2]
    }
}

END {

    # sort the unique keys
    m = asorti(x,y)

    # add the two strings to a numerically indexed array
    c[1] = date
    c[2] = size

    # set a variable to continue from
    f=2

    # loop through the sorted array of unique keys
    for (j=1;j<=m;j++) {

        # build the header line from the file by adding the sorted keys
        r = (r ? r : date OFS size) OFS y[j]

        # continue to add the sorted keys to the numerically indexed array
        c[++f] = y[j]
    }

    # print the header and empty
    print r
    r = ""

    # loop through the records ('i' is the number of records)
    for (j=1;j<=i;j++) {

        # loop through the subrecords ('f' is the number of unique keys)
        for (k=1;k<=f;k++) {

            # build the output line
            r = (r ? r OFS : "") b[j][c[k]]
        }

        # and print and empty it ready for the next record
        print r
        r = ""
    }
}

Here's the contents of a test file, called file:

date and time: 2013-02-21 18:18 PM
file size: 1283483 bytes
key1: value1
key2: value2

date and time: 2013-02-21 18:19 PM
file size: 1283493 bytes
key2: value2
key1: value1
key3: value3

date and time: 2013-02-21 18:20 PM
file size: 1283494 bytes
key3: value3
key4: value4

date and time: 2013-02-21 18:21 PM
file size: 1283495 bytes
key5: value5
key6: value6

Results:

2013-02-21 18:18 PM|1283483 bytes|value1|value2||||
2013-02-21 18:19 PM|1283493 bytes|value1|value2|value3|||
2013-02-21 18:20 PM|1283494 bytes|||value3|value4||
2013-02-21 18:21 PM|1283495 bytes|||||value5|value6
Steve
  • 51,466
  • 13
  • 89
  • 103
  • Awesome, although I got a syntax error, first one: 'awk: script.awk:29: b[i][a[1]]=a[2]'. Maybe my version of awk is too old? Mine is GNU Awk 3.1.3. And I see in your sample results, you do not have any headers, probably would not be too hard to add them to your elegant solution. – dabest1 Feb 25 '13 at 18:28
  • Ok, I have tested this using Cygwin with GNU Awk 4.0.2, works like a charm. Kudos for detailed code comments, as they will help me and many others to learn awk's hidden power. I think I may still need to workout a solution that will work with old awk, as the servers will not be upgraded anytime soon. – dabest1 Feb 26 '13 at 01:23
1

Here's a pure awk solution:

# split lines on ": " and use "|" for output field separator
BEGIN { FS = ": "; i = 0; h = 0; ofs = "|" }

# empty line - increment item count and skip it
/^\s*$/ { i++ ; next } 

# normal line - add the item to the object and the header to the header list
# and keep track of first seen order of headers
{
   current[i, $1] = $2
   if (!($1 in headers)) {headers_ordered[h++] = $1}
   headers[$1]
}

END {
   h--

   # print headers
   for (k = 0; k <= h; k++)
   {
      printf "%s", headers_ordered[k]
      if (k != h) {printf "%s", ofs}
   } 
   print "" 

   # print the items for each object
   for (j = 0; j <= i; j++)
   {
      for (k = 0; k <= h; k++)
      {
         printf "%s", current[j, headers_ordered[k]]
         if (k != h) {printf "%s", ofs}
      }
      print ""
   }
}

Example input (note that there should be a newline after the last item):

foo: bar
foo2: bar2
foo1: bar

foo: bar3
foo3: bar3
foo2: bar3

Example output:

foo|foo2|foo1|foo3
bar|bar2|bar|
bar3|bar3||bar3

Note: you will probably need to alter this if your data has ": " embedded in it.

porges
  • 30,133
  • 4
  • 83
  • 114
  • Wow, that is short and concise. Unfortunately, when I tested it on sample data set provided in the question, it gave inappropriate results. – dabest1 Feb 25 '13 at 18:21
  • @dabest1: any more information? the two cases I know it will fail are mentioned in the post (if the last line doesn't end with '\n' and if there are embedded ": "). The second one can be worked around with a little more code... – porges Feb 25 '13 at 21:27
  • Yes, I don't have embedded ": " and I do have "\n" at the end of the last line. You can see the debug info here: http://pastebin.com/9JvQ9Wd4. – dabest1 Feb 26 '13 at 00:02
  • @dabest1: I forgot that `print` by itself is equivalent to `print $0` (not `print ""`) so it was printing out the last line read in weird places. The version in the answer now prints your test data correctly. – porges Feb 26 '13 at 05:54
  • Great, it works now. This seems to be the easiest solution to understand. I have couple more questions though. I noticed that there is an extra trailing separator on each line and would like to get rid of it, when I change it to some other character than tab, I can clearly see it and have to pipe the output through sed to remove it, e.g. `sed 's/|$//'`. Another question is how to control the order of headers, let's say I would like to display the headers in order of first appearance or in alphabetical order? – dabest1 Feb 26 '13 at 18:12
  • I have made those two feature improvements to your answer, will see if it gets approved as an edit. Big thanks! – dabest1 Feb 26 '13 at 22:31
  • I altered it to not print out the extra separator at the end of the line. I didn't add it initially as it just makes the code longer :) I don't see your edit anywhere to approve... – porges Feb 26 '13 at 22:47
  • It says: 'Your edit will be placed in a queue until it is peer reviewed.', I did not know that the owner cannot even see my proposed edits. Here is a preview: http://pastebin.com/mq7a1CFN. – dabest1 Feb 26 '13 at 23:31
  • I got the notification eventually but it had already been rejected. I've manually put your changes in now :) – porges Feb 26 '13 at 23:43
  • Thanks for all your help! I did learn a lot about associative arrays used in awk from this post. – dabest1 Feb 26 '13 at 23:44
1

This does not make any assumptions on the column structure so it does not try to order them, however, all fields are printed in the same order for all records:

use strict;
use warnings;

my (@db, %f, %fields);
my $counter = 1;
while (<>) {
  my ($field, $value) = (/([^:]*):\s*(.*)\s*$/);
  if (not defined $field) {
    push @db, { %f };
    %f = (); 
  } else {
    $f{$field} = $value;
    $fields{$field} = $counter++ if not defined $fields{$field};
  }
}
push @db, \%f;

#my @fields = sort keys %fields; # alphabetical order
my @fields = sort {$fields{$a} cmp $fields{$b} } keys %fields; #first seen order

# print header
print join("|", @fields), "\n";

# print rows
for my $row (@db) {
  print join("|", map { $row->{$_} ? $row->{$_} : "" } @fields), "\n";
}
perreal
  • 94,503
  • 21
  • 155
  • 181
  • Thanks! You should mention that this is a solution in perl for the rest of the folks, who may no know it. I got some syntax errors, the first one is 'Use of uninitialized value $file in open at script.pl line 20.'. I'm assuming it is due to my version of perl: v5.14.2. – dabest1 Feb 25 '13 at 18:33
  • run it with a file name argument like this: `perl script.pl input_file` – perreal Feb 25 '13 at 21:44
  • Yes, that's better. First time I ran it like this: `perl script.pl < filename.txt`. Looks like I may need to tweak it a little bit as it outputs extra delimeters at the end and adds a space to a separator, but overall it achieves the main goal! – dabest1 Feb 26 '13 at 01:31
  • @dabest1, I did some tweaking – perreal Feb 26 '13 at 05:45
  • Thanks, this is great. One more question, how do I control the output of columns, so that they display in first seen order or in alphabetical order? – dabest1 Feb 26 '13 at 18:52
  • Updated, now it shows in first seen order. Uncomment the line that says alphabetical order to change the behavior. – perreal Feb 26 '13 at 23:59
  • Awesome and thank you, I wish I could mark more than one answer as an accepted answer. I am pretty sure that this will come in handy for projects in perl only. – dabest1 Feb 27 '13 at 00:13
1

example:

> ls -aFd * | xargs -L 5 echo | column -t
bras.tcl@      Bras.tpkg/           CctCc.tcl@       Cct.cfg      consider.tcl@
cvsknown.tcl@  docs/                evalCmds.tcl@    export/      exported.tcl@
IBras.tcl@     lastMinuteRule.tcl@  main.tcl@        Makefile     Makefile.am
Makefile.in    makeRule.tcl@        predicates.tcl@  project.cct  sourceDeps.tcl@
tclIndex   

                                                
Andreas Otto
  • 311
  • 1
  • 10
0

Using

use strict; use warnings;

# read the file paragraph by paragraph
$/ = "\n\n";

print "date and time|file size|key1|key2\n";

# parsing the whole file with the magic diamond operator
while (<>) {
    if (/^date and time:\s+(.*)/m) {
        print "$1|";
    }

    if (/^file size:(.*)/m) {
        print "$1|";
    }

    if (/^key1:(.*)/m) {
        print "$1|";
    }
    else {
        print "|";
    }

    if (/^key2:(.*)/m) {
        print "$1\n";
    }
    else {
        print "\n";
    }
}

Usage

perl script.pl file

Output

date and time|file size|key1|key2
2013-02-21 18:18 PM| 1283483 bytes| value| value
2013-02-21 18:19 PM| 1283493 bytes|| value
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
  • Thanks, I hoped someone already had the solution built, but looks like you had to do it from scratch. By the way I got it worked out myself as well, but with a few quirks. I added the solution as an answer. – dabest1 Feb 22 '13 at 04:14
  • Oh, you are hard coding the field names. I am looking for something that can accept arbitrary column names. I need it to be generic, so it can be applied to any list of key/value pairs. – dabest1 Feb 22 '13 at 04:20