-2

I have .csv file, I want to count total values from column 5 only if corresponding value of column8 is not equal to "999" I have tried with this, but not getting desired output.

cat test.csv | sed "1 d" |awk -F , '$8 != 999' | cut -d, -f5  | sort | uniq | wc -l >test.txt

Note that total number of records is more than 20K I do am getting the number of unique values but it is not subtracting values with 999. Can anyone help?

Sample Input:

Col1,Col2,Col3,Col4,Col5,Col7,Col8,Col9,Col10,Col11
1,0,0,0,ABCD,0,0,5436,0,0,0
1,0,0,0,543674,0,0,18999,0,0,0
1,0,0,0,143527,0,0,1336,0,0,0
1,0,0,0,4325,0,0,999,0,0,0
1,0,0,0,MCCDU,0,0,456,0,0,0
1,0,0,0,MCCDU,0,0,190,0,0,0
1,0,0,0,4325,0,0,190,0,0,0

What I want to do is not count the value from col5 if the corresponding value from col8 ==999.

By count total I mean total lines. In above sample input, col5 value of line 6 and 7 is same, I need them to count as one. I need to sort as Col5 values could be duplicate as I need to find total unique values.

beginner
  • 1
  • 2
  • 1
    Seems to work with a three-line demo file I put together. Can you share a sample input and explain what result you're getting for it vs the result you'd expect? – Mureinik Sep 17 '21 at 08:56
  • We need more information. Please, edit your question to include a sample of your input file, at least several complete lines, without ambiguous strings like "-------" . Please include also the output you want, corresponding to the provided input. And finally explain what you understand with "count total": is this the sum of the values of the 5th field, or the amount of lines containing it? Last but not least, why do you need to use `sort` and `uniq`? It is not clear from the title of the question. – Pierre François Sep 17 '21 at 09:10
  • edited as requested – beginner Sep 17 '21 at 09:24
  • Are you trying to _actually_ say that if a value exists with 999 in the 8th column, that value should not be included among the unique values, even if it also occurs on lines where the 8th column is not 999? If not, can you explain in more detail why the result 6 is wrong and why you expect a different result? – tripleee Sep 17 '21 at 09:35
  • 1
    As an aside, this would be trivial to refactor to pure Awk. – tripleee Sep 17 '21 at 09:35
  • Hi, There are two things to check, First: value in Col8 should not be 999. Second: value in col5 should not be duplicate, as in if there are same values in col5 for multiple records , it should count as 1 only, As in the sample input there is one value which is duplicate -MCCDU(for col5), my desired output here will be 5 – beginner Sep 17 '21 at 09:55
  • Your updated code prints 5 just like you wanted it to. You can verify that it only prints exactly the values you wanted by removing `wc -l` from the end. – tripleee Sep 17 '21 at 10:43
  • As an aside, please don't edit your question in such a way as to invalidate existing answers. – tripleee Sep 17 '21 at 10:43
  • 1
    Regarding `What I want to do is not count the value from col5 if the corresponding value from col8 ==999` - no, that's not what you want to do. What you want to do is count the value from col5 if the corresponding value from col8 !=999. They're different requirements as the the first statement does not mean that you do need to count lines where $8 is 27, for example. It's like the difference between your requirements being "I must not eat apples" vs "I must eat everything that is not an apple" and being handed a dead skunk. – Ed Morton Sep 17 '21 at 14:00

2 Answers2

2

Script:

awk 'BEGIN {FS=","} NR > 1 && $8 != 999 {uniq[$5]++} END {for(key in uniq) {total+=uniq[key]}; print "Total: "total}' input.csv

Output:

543674 1
143527 1
ABCD 1
MCCDU 2
4325 1
Total: 6
McLeon
  • 66
  • 3
  • This still results in a total count of 6, which the OP seems to think is incorrect. Vague points for showing how to refactor their code, but that was not the question here. – tripleee Sep 17 '21 at 09:49
  • The OP wants to count the total unqie values, you count the total values as you do `total+=uniq[key]`. Here is the same thing, with correction: `awk 'BEGIN{FS=","}($8!=999){a[$5]}END{for(i in a) t++; print t}' file.csv` – kvantour Sep 17 '21 at 12:58
  • 1
    `END {print length(a)}` works with gawk and mawk – glenn jackman Sep 17 '21 at 13:30
0

With an awk that supports length(array) (e.g. GNU awk and some others):

$ awk -F',' '(NR>1) && ($8!=999){vals[$5]} END{print length(vals)}' test.csv
5

With any awk:

$ awk -F',' '(NR>1) && ($8!=999) && !seen[$5]++{ cnt++ } END{print cnt+0}' test.csv
5

The +0 in the END is so you get numeric output even if the input file is empty.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185