1

I have a file (data.rdb) with the following format:

date    star    jdb texp
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  HD217987    2400000.23551544    900.
2013-11-22  TOI-134     2400000.23551544    900.
2013-11-22  tauCet      2400000.23551544    60. 
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.

some properties:

  • all columns are tab separated
  • the columns do not have the same width
  • the cells might not have the same length
  • the file will have much more columns than presented and a few hundreds of lines
  • the columns names can be any word, with no tabs or spaces or special characters

How can I move the column with header jdb to be the first column?

Some constrains:

  • this will be applied to multiple files, and the column jdb will not always appear at the same position
  • ideally the order of the remaining columns should not change
  • jdb will always be the 1st column in the end.

Thanks!

UPDATE

this is the awk block I am using at the moment:

BEGIN {
    numCols = split(column_list,cols)
    OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colNr=1 ? $(f["jdb"]): (colNr <= $(f["jdb"] ? 
$(f[colName] -1) : $(f[colName]))))
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

but it gives me no output... What I (think I) did:

  1. assign each column header value a number

  2. iterate over a range

    2.1 if iterator = 0 -> print column jdb

    2.2 if iterator <= column number of jdb -> print column number iterator - 1

    2.3 if iterator > column number of jdb -> print column number iterator

(this is on the continuation of the question I posed in https://stackoverflow.com/questions/56132249/extract-columns-from-tab-separated-file)


END RESULT

In the end I ended up using @Ed Morton's solution:

$ cat move_to_first.awk
BEGIN { FS=OFS="\t" }
NR==1 {
    cols[++numCols] = tgt
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
        if ($fldNr != tgt) {
            cols[++numCols] = $fldNr
        }
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        printf "%s%s", $(f[colName]), (colNr<numCols ? OFS : ORS)
    }
}

As a curiosity, to move the column to the last position, the above code just needs the following modification:

$ cat move_to_last.awk
BEGIN { 
    FS=OFS="\t" 
    }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
        if ($fldNr != target) {
            cols[++numCols] = $fldNr
        }
    }
    cols[++numCols] = target
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        printf "%s%s", $(f[colName]), (colNr<numCols ? OFS : ORS)
    }
}
jorgehumberto
  • 1,047
  • 5
  • 15
  • 33
  • 1
    Maybe look into installing dedicated utilities to handle csv-like files, rather then reinventing the wheel with bash and awk, like [csvkit](https://github.com/wireservice/csvkit) – KamilCuk May 17 '19 at 13:00
  • @EdMorton: My first attempt was by using the method suggested in your previous answer twice, one to select the `jdb` column and a a second one selecting all the other columns. Then I would concatenate. But that is ugly, so I am still looking at a more elegant solution. What I am trying to do is to select the column `jdb`, store its index in a variable and then re-order while doing the print... – jorgehumberto May 17 '19 at 14:05
  • `=` is assignment, `==` is comparison. See `colNr=1` as the condition in the ternary in your script. – Ed Morton May 17 '19 at 16:02

4 Answers4

1

This is a little verbose, but it does the job:

awk 'NR==1{for(i=1;i<=NF;i++){if ($i=="jdb") break;}} {printf "%s\t",$i; for (j=1;j<=NF;j++){if (i!=j){printf j==NF||(j==NF-1&&j+1==i)?"%s\n":"%s\t", $j}}}' yourfile.txt

Per Ed Morton's excellent suggestion. Here is the script with proper whitespace, indention, and line feeds:

    NR == 1 {
            for (i = 1; i <= NF; i++) {
                    if ($i == "jdb") {
                            break
                    }
            }
    }

    {
            printf "%s\t", $i
            for (j = 1; j <= NF; j++) {
                    if (i != j) {
                            printf (j == NF || j == NF - 1 && j + 1 == i ? "%s\n" : "%s\t"), $j
                    }
            }
    }

You can stick that into it's own file (say... script.awk) and then call it: awk -f script.awk yourfile.txt

JNevill
  • 46,980
  • 4
  • 38
  • 63
  • 1
    Thanks @EdMorton I didn't know that `gawk -o-` was even a thing. I've always formatted the one-liners that get out of control by hand. You are the man. – JNevill May 17 '19 at 14:26
1

Well, I was really hoping for a "teach a man to fish" moment here but you're getting answers anyway so... here's how to tweak the previous answer to do what you now want:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 {
    cols[++numCols] = tgt
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
        if ($fldNr != tgt) {
            cols[++numCols] = $fldNr
        }
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        printf "%s%s", $(f[colName]), (colNr<numCols ? OFS : ORS)
    }
}

$ awk -v tgt=jdb -f tst.awk data.rdb
jdb     date    star    texp
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      HD217987        900.
2400000.23551544        2013-11-22      TOI-134 900.
2400000.23551544        2013-11-22      tauCet  60.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.

Note how simple the loop is that gets executed once per input line where you want the efficiency to be because all the hard work of determining the output order is done in the NR==1 block that just gets executed once for the whole file.

In this particular case where you don't actually care about the other column names you can write it more concisely and efficiently as:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==1 {
    numOutFlds = 1
    for (inFldNr=1; inFldNr<=NF; inFldNr++) {
        out2inFldNrs[$inFldNr == tgt ? 1 : ++numOutFlds] = inFldNr
    }
}
{
    for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
        inFldNr = out2inFldNrs[outFldNr]
        printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
    }
}

$ awk -v tgt=jdb -f tst.awk data.rdb
jdb     date    star    texp
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      epsInd  100.
2400000.23551544        2013-11-22      HD217987        900.
2400000.23551544        2013-11-22      TOI-134 900.
2400000.23551544        2013-11-22      tauCet  60.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.
2400000.23551544        2013-11-22      BD+01316        300.
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Sorry, had a presentation, so could not answer before, but I definitively like the `teach a man how to fish `approach. I've posted my attempt at this now. Thank you for the answer, I will try to get my head around this in the evening :) – jorgehumberto May 17 '19 at 15:49
  • 1
    Works perfectly, now time to understand it. :) thanks! – jorgehumberto May 17 '19 at 16:01
  • This solution works well when processing a single file at time. If you want to run multiple files with single awk call. Replace NR with FNR. Then the record counter will restart at each file. – Dudi Boy May 17 '19 at 19:36
  • 1
    @DudiBoy it's counting fields, not records, so if the files all have the same number of fields then the only thing you'd need to do is add `-i inplace` (GNU awk) to make it work while if they don't all have the same number of fields then you'd need to delete the arrays in an FNR==1 section, not just change NR to FNR. Multiple files doesn't seem to be what the OP is dealing with though. – Ed Morton May 17 '19 at 20:35
  • OK, I've looked into this and still can't understand twhat `cols[++numCols] = tgt` and `cols[++numCols] = $fldNr` mean. where do you define `numCols`? – jorgehumberto Jun 04 '19 at 15:26
  • You don't need to declare variables in awk, they get declared and assigned a value the first time you use them based on the context in which you use them. So `cols[++numCols] = tgt` declares an array named `cols[]` initially empty and a numeric variable named `numCols` initially of type numeric-string and value zero-or-null which is then incremented to become value `1` and of type `numeric` since we just performed a numeric operation on it. `col["1"]` is then used to store whatever value the valriable `tgt` contains. – Ed Morton Jun 04 '19 at 15:49
  • ok, now I understood, so if I want `target` to be the last column, I just need to move `cols[++numCols] = tgt` to after the `for (fldNr=1; fldNr<=NF; fldNr++)` cycle, correct? – jorgehumberto Jun 04 '19 at 15:58
  • I'm sorry, it's been long enough that I don't remember this question and I don't feel like going back and re-reading it, etc. Please ask a new question if you have one. – Ed Morton Jun 04 '19 at 16:00
  • 1
    sorry, I do not have another question for the moment, just checking I understood what you meant. Thanks for all the help! :) – jorgehumberto Jun 04 '19 at 16:03
  • Sorry, I'd have to re-read your question, etc. to know how to answer you and I'm too lazy - I wish you'd asked a couple of weeks ago when I might've remembered what this was about! – Ed Morton Jun 04 '19 at 16:09
0

So the task is twofold:

  • first determine which column is the column we want to be the first
  • then change the order of the columns

So:

# our testing input file
cat <<EOF >file
date    star    jdb texp
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  epsInd      2400000.23551544    100.
2013-11-22  HD217987    2400000.23551544    900.
2013-11-22  TOI-134     2400000.23551544    900.
2013-11-22  tauCet      2400000.23551544    60. 
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.
2013-11-22  BD+01316    2400000.23551544    300.
EOF

# my copy+paste messed up tabs with spaces, fix it
sed 's/[[:space:]]\+/\t/g' -i file


# first we need header count.
# I could remove all characters except tabs and use wc -c
# but was lazy, this will not affect performance anyway
hdrcnt=$(
    head -n1 file |
    tr '\t' '\n' |
    wc -l
)

# get the column number that has jdb
# I get the first line
# substitute tab with newlines
# and get the line number with "jdb"
num=$(
    head -n1 file |
    tr '\t' '\n' |
    grep -n jdb | 
    cut -d: -f1
)

# ten I generate the awk script
# so it's like '{print $num, $1, $2 ... except $num ... $hdrcnt }'
awkarg='{print $'"$num"', '"$(
    seq $hdrcnt |
    grep -v "$num" |
    sed 's/\(.*\)/$\1, /' |
    sed '$s/, //' |
    tr -d '\n'
)"'}'

# finally run awk
awk -vIFS='\t' -vOFS='\t' "$awkarg" file
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
0

In Perl, you can profit from the Text::CSV_XS library:

#! /usr/bin/perl
use warnings;
use strict;

use Text::CSV_XS;

open my $fh, '<', shift or die $!;

my $csv = 'Text::CSV_XS'->new({sep_char => "\t"});

my $row = $csv->getline($fh);

my ($jdb) = grep $row->[$_] eq 'jdb', 0 .. $#$row;

do {
    unshift @$row, splice @$row, $jdb, 1;
    $csv->say(*STDOUT, $row);
} while $row = $csv->getline($fh);
choroba
  • 231,213
  • 25
  • 204
  • 289