6

I have the following problem that I want to solve in awk. I have one large text table, comma separated, consisting of 100k rows and 5k cols. The first row is a header and the first column is a record id. I then have a second text file that contains a subset of the headers in the first file. I want to extract all the columns of the first file whose header is contained in the list given in the second file. Here an example of the inputs and the desired output:

DATA.TXT

   ID, head1, head2, head3, head4  
    1, 25.5, 1364.0, 22.5, 13.2  
    2, 10.1, 215.56, 1.15, 22.2  

LIST.TXT

head1  
head4  

Desired output:

ID, head1, head4  
1, 25.5, 13.2  
2, 10.1, 22.2

Anybody can give me some advice on how to solve this problem in awk or however through unix scripting? Thanks in advance for any help!

codeforester
  • 39,467
  • 16
  • 112
  • 140
lud
  • 63
  • 1
  • 1
  • 4
  • Did you want ID in the LIST.TXT file? (It will simplify solution if you can include it). Good luck. (Edit your question if that is true). – shellter Jun 19 '12 at 10:43
  • @shellter in my case the solution is simpler the way it is now :) – Lev Levitsky Jun 19 '12 at 10:45
  • no, the LIST.TXT does not have any ID field and is give as is. The ID in the output is the one in DATA.TXT. @Lev your solution is indeed (apparently) simple. Once I understand why each one symbol is where it is! :) – lud Jun 19 '12 at 11:11

2 Answers2

4

There is a useful awk script here which you can use to extract specific column names from a csv file.

I have modified it slightly so that it can read column names from another file. Save the script below as dataExtractor.sh.

#!/bin/bash

DATAFILE=${1:-data.txt}
COLUMNFILE=${2:-list.txt}

awk -F, -v colsFile="$COLUMNFILE" '
   BEGIN {
     j=1
     while ((getline < colsFile) > 0) {
        col[j++] = $1
     }
     n=j-1;
     close(colsFile)
     for (i=1; i<=n; i++) s[col[i]]=i
   }
   NR==1 {
     for (f=1; f<=NF; f++)
       if ($f in s) c[s[$f]]=f
     next
   }
   { sep=""
     for (f=1; f<=n; f++) {
       printf("%c%s",sep,$c[f])
       sep=FS
     }
     print ""
   }
' "$DATAFILE"

Running it:

$ cat data.txt
ID,head1,head2,head3,head4
1,25.5,1364.0,22.5,13.2
2,10.1,215.56,1.15,22.2

$ cat list.txt
ID
head1
head4

$ dataExtractor.sh data.txt list.txt
1,25.5,13.2
2,10.1,22.2
dogbane
  • 266,786
  • 75
  • 396
  • 414
  • Thank you for your answer. I have used the more compact one below but now I see where I went wrong in my awk attempts! Very useful!! – lud Jun 19 '12 at 11:06
  • 2
    How about if we want to see ID, head1, head4 in the output as well? It appears that the ID, head1, head4 are gone in output section. – Sakura Nov 22 '17 at 10:11
4

I have an idea, but since I'm not experienced in shell programming (and don't know awk) this looks like reinventing some wheels in a ridiculous way:

$ cat DATA.TXT 
ID, head1, head2, head3, head4
1, 25.5, 1364.0, 22.5, 13.2
2, 10.1, 215.56, 1.15, 22.2

$ cat LIST.TXT 
head1
head4

$ cols=($(sed '1!d;s/, /\n/g' DATA.TXT | grep -nf LIST.TXT | sed 's/:.*$//'))

$ cut -d ',' -f 1$(printf ",%s" "${cols[@]}") DATA.TXT 
ID, head1, head4
1, 25.5, 13.2
2, 10.1, 22.2

P.S. I used some very basic ideas about bash arrays from this and this answers.

Community
  • 1
  • 1
Lev Levitsky
  • 63,701
  • 20
  • 147
  • 175
  • This works perfectly! Thanks. And I also have some material to think over it and understand what you did! – lud Jun 19 '12 at 11:04
  • @lud Glad to help. Feel free to ask if you have trouble understanding what I did there. (Hint: take it apart and see what happens, starting with `sed '1!d;s/, /\n/g' DATA.TXT` and so on) – Lev Levitsky Jun 19 '12 at 11:08
  • This script matches maximum first three numbers uniquely, for example, if headers in Data.txt file are like this: ID,10020,10022,10023,10024 list.txt 10020 10022 10023 So instead of printing till 10023, it will also print 10024, please check this issue – Waqas Khokhar Apr 15 '19 at 11:57
  • this does not seem to work when the colnames in LIST.TXT contain numbers and special characters (eg 34-0.0). Can it be adapted to work for these? – Powege Jul 10 '20 at 12:42
  • @Powege I would suspect that you need to add a -F before the -nf in the grep command to state that the strings as fixed, and not to interpret as regexps ("." has a special meaning, matching all characters in a regexp) – user36196 Nov 23 '21 at 10:42