I have a file (data.rdb
) with the following format:
col1 col2 col3 col4 col5 col6 col7
aaa1 bbb1 ccc1 ddd1 eee1 fff1 ggg1
aaa2 bbb2 ccc2 ddd2 eee2 fff2 ggg2
aaa3 bbb3 ccc3 ddd3 eee3 fff3 ggg3
Some properties of the data:
- all columns are tab separated
- the columns do not have the same width
- the cells might not have the same length
- the file will have much more columns than presented and a few hundreds of lines
- the columns names I provided are just generic, the real names can be any word, with no tabs or spaces or special characters.
I need to extract some of the columns by name using bash, e.g., col1
, col3
and col6
, where the columns to select come from a shell variable defined as COLUMN_LIST=$@
where $@
are the parameters passed to my shell script. The number and name of the parameters might change each time I call the script.
The script needs to be in bash, cannot be python or similar.
Any ideas? I thought about using awk
/gawk
, but I do not know how to select by column name. The column order might change from file to file.
Thanks Jorge
UPDATE
for some reason, none of these solutions seem to work on my real data files (i.e., I get no output at all), so I am posting a subset of one of those:
date star jdb texp
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 HD217987 2400000.23551544 900.
2013-11-22 TOI-134 2400000.23551544 900.
2013-11-22 tauCet 2400000.23551544 60.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300.
in this case, I would be interested in columns star
jdb
and texp
UPDATE 2
I've used @EdMorton's code and this is the result:
date star jdb texp date star jdb texp
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 epsInd 2400000.23551544 100. 2013-11-22 epsInd 2400000.23551544 100.
2013-11-22 HD217987 2400000.23551544 900. 2013-11-22 HD217987 2400000.23551544 900.
2013-11-22 TOI-134 2400000.23551544 900. 2013-11-22 TOI-134 2400000.23551544 900.
2013-11-22 tauCet 2400000.23551544 60. 2013-11-22 tauCet 2400000.23551544 60.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
2013-11-22 BD+01316 2400000.23551544 300. 2013-11-22 BD+01316 2400000.23551544 300.
UPDATE 3
I ended up using EdMorton's version of awk
-- mainly for flexibility on the output -- but with the modification that I do not want it to output wrong columns:
BEGIN {
numCols = split(column_list,cols)
OFS="\t"
}
{ sub(/\r$/,"") }
NR==1 {
for (fldNr=1; fldNr<=NF; fldNr++) {
f[$fldNr] = fldNr
}
}
{
for (colNr=1; colNr<=numCols; colNr++) {
colName = cols[colNr]
colVal = (colName in f ? $(f[colName]) : "")
printf "%s%s", colVal, (colNr<numCols ? OFS : ORS)
}
}
The main issue I got was that the header line was not tab separated and as such column breakdown did not work. An easy way to spot tab/non-tab characters:
tr $'\t' '#' < data.rdb | head -2
which gave on one of my test files:
date star jdb texp
2013-11-22#epsInd#2400000.23551544#100.