7

I have a csv file with headers:

a,b,c,d,e,f,g,h

I would like to do something

cat abc.csv | sed "something to split them" | grep "e"  

#position of "e"

Can someone guide me how do I get the column idx of which header 'e' is at?

zx8754
  • 52,746
  • 12
  • 114
  • 209
aceminer
  • 4,089
  • 9
  • 56
  • 104

6 Answers6

14

Assuming your goal is to say "which column is this value in", you have a number of options, but this works:

sed -n $'1s/,/\\\n/gp' abc.csv | grep -nx 'e'
#output: 5:e

If you want to get just the number out of that:

sed -n $'1s/,/\\\n/gp' abc.csv | grep -nx 'e' | cut -d: -f1
#output: 5

Explanation:

Since the headers are on the first line of the file, we use the -n option to tell sed not to print out all the lines by default. We then give it an expression that starts with 1, meaning it is only executed on the first line, and ends with p, meaning that line gets printed out afterward.

The expression uses ANSI quotes ($'...') simply so it's easier to read: you can put a newline in it with \n instead of having to include a literal newline. Regardless, by the time the shell is done with it, the expression $'1s/,/\\\n/gp' gets passed to sed as 1s/,/\ /gp, which tells it to replace every comma on the first line with a newline and then print out the result. The output of just the sed on your example would be this:

a
b
c
d
e
f
g
h

(If your CSV file has many lines, you may want to add ;q to the end of the sed command so that it quits after that first line instead of continuing to read and do nothing with the rest of the lines.)

We then pipe that output through a grep command looking for e. We pass the -x option so that it only matches lines consisting of exactly 'e', not just any line containing an 'e' (Thanks @Marcel and @Sundeep), plus the -n option that tells it to include the line number of matching lines in its output. In the example, it outputs 5:e, where the 5: says that the rest of the output is from the 5th line of the input.

We can then pipe that through cut with a field delimiter (-d) of : to extract just the first field (-f1), which is the line number in the sed output - which is the field number in the original file.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • @aceminer, you need to use grep -n "^yourtext$" or you will have problems with substrings. "^" indicates the start of the line and "$" the end of the line. You can check my answer for more details. – Marcel Sep 30 '16 at 01:19
  • 1
    or use `-x` option – Sundeep Sep 30 '16 at 01:24
  • 2
    I didn't know -x @Sundeep. It's so beautifu!!! I have incorporated to my answer also. Thanks! – Marcel Sep 30 '16 at 01:30
  • there is also `-w` option to match whole words.. avoids needing `\b` in most cases... :) – Sundeep Sep 30 '16 at 01:37
  • another suggestion, (found in http://stackoverflow.com/questions/6958841/use-grep-to-report-back-only-line-numbers) - use `sed -n '/^e$/='` to get only line number.. avoids grep + cut – Sundeep Sep 30 '16 at 01:42
  • assumes the data looks like the example - no regexp metacharacters in the target field name and relies on non-POSIX sed and bash-only shell constructs, not a general solution. – Ed Morton Sep 30 '16 at 14:43
  • Point taken, @EdMorton. It still makes more of an attempt to meet the OP's stated intent instead of only working on their supplied sample data. – Mark Reed Oct 01 '16 at 14:58
7
  1. head is selecting the first line (header);
  2. tr is replacing the delimiter for line breaks;
  3. grep is selecting the line that contains exactly the string you want (substrings are ignored) and the line number is shown as well. In the example, we will have 5:e;
  4. cut is using ':' as delimiter and selecting the first column. So just the line number will be shown.

head -n1 abc.csv | tr "," "\n" | grep -nx e |  cut -d":" -f1

File content:

a,b,c,d,e,f,g,h

String that you want:

e

Output:

5
Marcel
  • 2,810
  • 2
  • 26
  • 46
  • Yeah, Mark. You are right. I have updated the answer with "head -1". Thank you very much. And +1 for your good answer. – Marcel Sep 30 '16 at 01:26
  • suggest to use `code` (the `{}` icon) to format instead of quote.. it will give syntax highlighting :) – Sundeep Sep 30 '16 at 01:44
  • Good point, @Sundeep. Actually, I tried but { } is not working for me now. Just this time o_O – Marcel Sep 30 '16 at 02:09
  • looks like it doesn't detect code formatting immediately following bullets or numbering... I added a dummy `` tag as workaround :D – Sundeep Sep 30 '16 at 02:24
  • nb: POSIXly correct `head` takes `-n 1` instead of just `-1` – Mark Reed Sep 30 '16 at 13:49
  • But the documentation says to use this manner: -n, --lines=[-]NUM I didn't get your point about POSIX, @MarkReed – Marcel Sep 30 '16 at 14:05
  • @MarcelJacquesMachado - the POSIX spec says that head takes the argument `-n count`, even though the traditional implementation took (and most still do) just `-`*count*. Some implementations (including, I believe, GNU coreutils, if a certain environment variable is set) strictly enforce the POSIX requirement and die with a usage error when passed `-`*num*. – Mark Reed Sep 30 '16 at 14:15
  • OK,@MarkReed. Good point. I have updated the answer. Thanks for sharing your knowledge =) – Marcel Sep 30 '16 at 14:23
2

This is a bit of hack, but it will give you the index of e:

head -n1 abc.csv | grep -oE '^.*(,|^)e(,|$)' | tr -Cd , | wc -c

It works by extracting the part of the top row up to the e, then it removes all characters except for the commas, and finally it counts the number of commas.

redneb
  • 21,794
  • 6
  • 42
  • 54
1
$ cat ip.txt 
a,b,c,d,e,f,g,h
1,2,3,4,5,6,7,8

With perl

$ # can also use: perl -F, -lane 'print grep {$F[$_-1] eq "e"} 1..$#F+1; exit'
$ perl -F, -lane 'foreach (0..$#F){ print $_+1 if $F[$_] eq "e" } exit' ip.txt 
5

With awk

$ awk -F, '{ for(i=1; i<=NF; i++) if($i=="e"){print i} exit}' ip.txt 
5

In both cases:

  • Input line is split on , and comparison made against each element to find matching index
  • No output if a match is not found
  • exit is used to quit immediately after reading first line
Sundeep
  • 23,246
  • 2
  • 28
  • 103
1

Following command will loop through all the fields and check for the string "e". If Found then print the index of that location.

For all the lines:

awk -F, '{for(i=1;i<=NF;i++) if($i=="e") print i}' input.csv

IF it's limited to header/ first row then :

awk -F, 'NR==1{for(i=1;i<=NF;i++) if($i=="e") {print i;exit}}' input.csv

Example:

echo "a,b,c,d,e,f,g,h" |awk -F, '{for(i=1;i<=NF;i++) if($i=="e") {print i}'
5
P....
  • 17,421
  • 2
  • 32
  • 52
  • Real file might have many lines; should probably put an `(NR==1)` guard on that expression and turn the `print i` into `{ print i; exit }`. – Mark Reed Sep 30 '16 at 13:43
  • @MarkReed : hey mark, but that would make solution limited to first row only. Why would we would limit it ? – P.... Sep 30 '16 at 13:49
  • OP is looking for "which column number is this header". Headers are the first row. – Mark Reed Sep 30 '16 at 13:50
0
$ awk '{print (index($0,"e")+1)/2}' file
5
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • assumes the data looks like the example - one character per field, only one line. not a general solution. – Mark Reed Sep 30 '16 at 13:48
  • Absolutely correct, it addresses the OPs question as all of our solutions do. Yours, for example, would fail if the target field name contained RE metacharacters. So why the downvote? – Ed Morton Sep 30 '16 at 13:53