30

I have a data frame with >100 columns each labeled with a unique string. Column 1 represents the index variable. I would like to use a basic UNIX command to extract the index column (column 1) + a specific column string using grep.

For example, if my data frame looks like the following:

Index  A  B  C...D  E  F
p1     1  7  4   2  5  6
p2     2  2  1   2  .  3
p3     3  3  1   5  6  1

I would like to use some command to extract only column "X" which I will specify with grep, and display both column 1 & the column I grep'd. I know that I can use cut -f1 myfile for the first bit, but need help with the grep per column. As a more concrete example, if my grep phrase were "B", I would like the output to be:

Index  B
p1     7
p2     2
p3     3

I am new to UNIX, and have not found much in similar examples. Any help would be much appreciated!!

agc
  • 7,973
  • 2
  • 29
  • 50
AMS
  • 494
  • 1
  • 5
  • 12

3 Answers3

43

You need to use awk:

awk '{print $1,$3}' <namefile>

This simple command allows printing the first ($1) and third ($3) column of the file. The software awk is actually much more powerful. I think you should have a look at the man page of awk.

A nice combo is using grep and awk with a pipe. The following code will print column 1 and 3 of only the lines of your file that contain 'p1':

grep 'p1' <namefile> | awk '{print $1,$3}'

If, instead, you want to select lines by line number you can replace grep with sed:

sed 1p <namefile> | awk '{print $1,$3}'

Actually, awk can be used alone in all the examples:

awk '/p1/{print $1,$3}' <namefile> # will print only lines containing p1
awk '{if(NR == 1){print $1,$3}}' <namefile> # Will print only first line
Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Riccardo Petraglia
  • 1,943
  • 1
  • 13
  • 25
  • 2
    Thanks! This is very helpful. Can awk be used simply to return columns with grep when you do not know the column number but instead know the column name? – AMS Sep 17 '16 at 21:08
  • @Riccardo Petraglia: For you solution ```grep 'p1' | awk '{print $1,$3}'``` what if I need only the first three characters of $3? – anishjp Sep 29 '20 at 16:41
  • @anishjp https://stackoverflow.com/questions/1405611/how-to-extract-the-first-two-characters-of-a-string-in-shell-scripting – Riccardo Petraglia Sep 29 '20 at 17:08
  • I think this is not addressing the question as OP wants to extract the column using the column name rather than the column position. – al-ash Sep 01 '22 at 05:25
  • 1
    @RiccardoPetraglia OP uses a hypothetical "column-wise grep" to explain what she/he wants to achieve. Despite the confusion, with the help of the example, OP makes the question imho clear enough. – al-ash Sep 02 '22 at 06:00
11

First figure out the command to find the column number.

columnname=C
sed -n "1 s/${columnname}.*//p" datafile | sed 's/[^\t*]//g' | wc -c

Once you know the number, use cut

cut -f1,3 < datafile 

Combine into one command

cut -f1,$(sed -n "1 s/${columnname}.*//p" datafile | 
   sed 's/[^\t*]//g' | wc -c) < datafile

Finished? No, you should improve the first sed command when one header can be a substring of another header: include tabs in your match and put the tabs back in the replacement string.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33
Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Thank you so much. Having trouble with this command such that the first sed... outputs "1" no matter how I set my column name? – AMS Sep 17 '16 at 22:02
  • Did you set columnname to the name of one of the headers? When the part until the first pipe shows the header line without anything with the columnname taht past is working. Now look to the second `sed`. Are your fields seperated with tabs? When you have a variable number of spaces, it will be hard to recognise them between header fields with a space inside, like `this one` or `nr of errors`. – Walter A Sep 18 '16 at 16:15
0

If you want to retain the first column and a column which contains a specific string in the first row (eg. B), then this should work. It assumes that your string is only present once though.

awk '{if(NR==1){c=0;for(i=1;i<=NF;i++){c++;if($i=="B"){n=c}}}; print $1,$n}' myfile.txt

There is probably a better solution with amazing awk, but this should work.

EXPLANATION: In the first row (NR==1), it iterates through all columns for(i=1;i<=NF;i++) until it finds the string, save the column number and then prints it. If you want to pass the string as a variable then you can use the -v option.

Benjamin Loison
  • 3,782
  • 4
  • 16
  • 33