Extract columns from tab separated file

Question

I have a file (data.rdb) with the following format:

col1    col2    col3    col4    col5    col6    col7
aaa1    bbb1    ccc1    ddd1    eee1    fff1    ggg1
aaa2    bbb2    ccc2    ddd2    eee2    fff2    ggg2
aaa3    bbb3    ccc3    ddd3    eee3    fff3    ggg3

Some properties of the data:

all columns are tab separated
the columns do not have the same width
the cells might not have the same length
the file will have much more columns than presented and a few hundreds of lines
the columns names I provided are just generic, the real names can be any word, with no tabs or spaces or special characters.

I need to extract some of the columns by name using bash, e.g., col1, col3 and col6, where the columns to select come from a shell variable defined as COLUMN_LIST=$@ where $@ are the parameters passed to my shell script. The number and name of the parameters might change each time I call the script.

The script needs to be in bash, cannot be python or similar.

Any ideas? I thought about using awk/gawk, but I do not know how to select by column name. The column order might change from file to file.

Thanks Jorge

UPDATE

for some reason, none of these solutions seem to work on my real data files (i.e., I get no output at all), so I am posting a subset of one of those:

date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.

in this case, I would be interested in columns star jdb and texp

UPDATE 2

I've used @EdMorton's code and this is the result:

date    star    jdb texp    date    star    jdb texp
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  epsInd   2400000.23551544   100.    2013-11-22  epsInd   2400000.23551544   100.
2013-11-22  HD217987 2400000.23551544   900.    2013-11-22  HD217987 2400000.23551544   900.
2013-11-22  TOI-134  2400000.23551544   900.    2013-11-22  TOI-134  2400000.23551544   900.
2013-11-22  tauCet   2400000.23551544   60.     2013-11-22  tauCet   2400000.23551544   60. 
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.
2013-11-22  BD+01316 2400000.23551544   300.    2013-11-22  BD+01316 2400000.23551544   300.

UPDATE 3

I ended up using EdMorton's version of awk -- mainly for flexibility on the output -- but with the modification that I do not want it to output wrong columns:

BEGIN {
    numCols = split(column_list,cols)
    OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : "")
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

The main issue I got was that the header line was not tab separated and as such column breakdown did not work. An easy way to spot tab/non-tab characters:

tr $'\t' '#' < data.rdb | head -2

which gave on one of my test files:

date    star    jdb texp
2013-11-22#epsInd#2400000.23551544#100.

How do you call your script? Like `./myscript 1 3 6` or `./myscript colname1 colname2 colname3` — kvantour, May 14 '19 at 14:16
sorry, by anything I mean any word, I've updated the question. — jorgehumberto, May 14 '19 at 17:14
They are indeed tab separated, and can't seem to find any issue with them. — jorgehumberto, May 14 '19 at 17:36
The only difference I can find is that I have much more lines and more columns — jorgehumberto, May 14 '19 at 17:38
I strongly suspect your input file has DOS line endings, see https://stackoverflow.com/q/45772525/1745001 for what that means and how to fix it. Meanwhile, I updated my answer to protect against DOS line endings and/or a user asking for a column name that doesn't exist to be output. Please try it now and let us know the result. — Ed Morton, May 14 '19 at 17:57
Ok, I've checked a couple of the files with `cat -vE` and the only strange thing I get are dollar signs at the end of every line... Running your updated version of the code gets me the correct column(s) headers, however all filled with N/A. — jorgehumberto, May 14 '19 at 20:29
`$` signs indicate the line feed character. They are normal/expected at the end of every line. From your other comment it sounds like your columns actually are NOT tab separated as you claimed and that is your problem. For next time - you could have spotted that by running `tr $'\t' '#' < file` to make tabs visible. — Ed Morton, May 15 '19 at 14:55
that is a good command to keep, thanks! When I run it I get `#` in place of the tabs, but spaces in the headers, although I create my headers with `header=$(echo date$'\t'star$'\t'rv$'\t'dvrms$'\t'). Why is it so? — jorgehumberto, May 15 '19 at 15:00
You're welcome. You didn't quote the string you passed to echo so echo saw each tab-separated value as a separate argument and output them separated by blanks. Try just `header=$'date\tstar\trv\tdvrms'` instead. — Ed Morton, May 15 '19 at 15:28
ok, thanks, will correct my code taking this into account, thanks! — jorgehumberto, May 15 '19 at 15:33

anubhava · Answer 1 · 2019-05-14T17:58:01.993

The column order might change from file to file.

You may use this approach using awk that takes space separated header column names as input and converts it into column number first by processing first record. Once required column numbers are retrieved we just print them from next row onwards.

awk -v cols='col1 col3 col6' 'BEGIN {
   FS=OFS="\t"
   n = split(cols, a, " ")
   for (i=1; i <= n; i++)
      c[a[i]]
}
{
   sub(/\r$/, "")
}
NR == 1 {
   for (i=1; i<=NF; i++)
      if ($i in c)
         hdr[i]
}
{
   for (i=1; i<=NF; i++)
      if (i in hdr)
         s = sprintf(s "%s%s", OFS, $i)
   sub(OFS, "", s)
   print s
   s =""
} ' file | column -t

star      jdb               texp
epsInd    2400000.23551544  100.
epsInd    2400000.23551544  100.
epsInd    2400000.23551544  100.
HD217987  2400000.23551544  900.
TOI-134   2400000.23551544  900.
tauCet    2400000.23551544  60.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.
BD+01316  2400000.23551544  300.

PS: Added column -t to format output in tabular format.

Thanks, but I can't seem to get it to work on my files. One note, the real columns are note named "colX" but have totally different names, such as `rv`, `fwhm`, etc, would that make a difference? — jorgehumberto, May 14 '19 at 16:53
@jorgehumberto: Check updated answer with your newly updated input — anubhava, May 14 '19 at 17:58

Ed Morton · Accepted Answer · 2019-05-14T17:51:53.733

2

The best way to deal with this is to create an array (f[] below) that maps the column header strings (i.e. the field names) to the field numbers when reading the header line and then just access the fields by their names from then on.

Updated to protect against the caller asking for a column name that doesn't exist and against DOS line endings:

$ cat tst.awk
BEGIN {
    numCols = split(column_list,cols)
    FS=OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
    for (fldNr=1; fldNr<=NF; fldNr++) {
        f[$fldNr] = fldNr
    }
}
{
    for (colNr=1; colNr<=numCols; colNr++) {
        colName = cols[colNr]
        colVal  = (colName in f ? $(f[colName]) : (NR>1 ? "N/A" : colName))
        printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
    }
}

$ awk -v column_list='col1 col3 col6' -f tst.awk data.rdb
col1    col3    col6
aaa1    ccc1    fff1
aaa2    ccc2    fff2
aaa3    ccc3    fff3

$ awk -v column_list='col1 col3 col6 bob' -f tst.awk data.rdb
col1    col3    col6    bob
aaa1    ccc1    fff1    N/A
aaa2    ccc2    fff2    N/A
aaa3    ccc3    fff3    N/A

Note that with the above approach if you like you can change the order of the columns for output, not just print them in their original order:

$ awk -v column_list='col5 col2 col4' -f tst.awk data.rdb
col5    col2    col4
eee1    bbb1    ddd1
eee2    bbb2    ddd2
eee3    bbb3    ddd3

edited May 14 '19 at 17:51

answered May 14 '19 at 17:01

Ed Morton

188,023
17
78
185

ok, managed to get the above working, just swapped `FS=OFS="\t" ` by `OFS="\t" `. – jorgehumberto May 15 '19 at 14:44
You said in your question `all columns are tab separated`. If you had to not set FS to tab for this to work then that statement isn't true which would certainly explain why you couldn't get any of the solutions we provided based on that statement to work for you. – Ed Morton May 15 '19 at 14:46
That is the strange thing, I've double checked my data files and the separators appear to be `tabs` (or at least show as `tab` when I select them on the command line..). – jorgehumberto May 15 '19 at 14:55
And I've generated them as tabs. When I create the files with `awk` (code adapted from https://stackoverflow.com/questions/25168259/how-do-you-check-if-a-text-file-had-tab-as-its-delimiter-in-bash) I made sure the OFS="\t" is set. – jorgehumberto May 15 '19 at 14:57
Run `tr $'\t' '#' < data.rdb | head -2` and post the output. – Ed Morton May 15 '19 at 14:58
1

ok, problem appears to be with the headers, sorry about that.line1: `date star jdb texp` line2: `2013-11-22#epsInd#2400000.23551544#100.` – jorgehumberto May 15 '19 at 15:06

Thor · Answer 3 · 2019-05-16T05:17:15.110

0

You could do it with coreutils. Assuming you have a file callef cols containing the desired columns, e.g.:

col2
col3
col6

You can extract column numbers like this:

head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,

Output:

2,3,6

Pass this to cut, e.g.:

cut -f $(head -n1 infile | tr '\t' '\n' | grep -nf cols | cut -d: -f1 | paste -sd,) infile

Output:

col2    col3    col6
bbb1    ccc1    fff1
bbb2    ccc2    fff2
bbb3    ccc3    fff3

edited May 16 '19 at 05:17

answered May 14 '19 at 15:00

Thor

45,082
11
119
130

what is the format of `cols?` a file or a shell variable? – jorgehumberto May 14 '19 at 16:56

Extract columns from tab separated file

3 Answers3

Linked