1
year start  year end    location    topic   data type   data value
2016    2017    AL  Alcohol Crude Prevalence    16.9
2016    2017    CA  Alcohol Other   15
2016    2017    AZ  Neuropathy  Other   13.1
2016    2017    HI  Smoke   Crude Prevalence    20
2016    2017    IL  Cancer  Other   20
2016    2017    KS  Cancer  Other   14
2016    2017    AZ  Smoke   Crude Prevalence    16.9
2016    2017    KY  Cancer  Other   13.8
2016    2017    LA  Alcohol Crude Prevalence    18

The answer is required to count lines which are associated with the disease “topic”s "Alcohol" and "Cancer".

I already got the index of column named as "topic" , but the contents I am going to extract from "topic" is not correct, then I am not able to count the lines which is containing "Alcohol" and "Cancer", how to solve it?

Here is my code:

awk '{print $4}' AAA.csv > topic.txt
head -n5 topic.txt | less
Jennifer Chen
  • 71
  • 1
  • 7

2 Answers2

1

You could try the following:

the call to awk gets the column in question, the grep filters the keywords, and the word count counts the lines

$ awk '{ print $4 }' data.txt | grep -e Alcohol -e Cancer | wc -l
6
Slawomir Chodnicki
  • 1,525
  • 11
  • 16
  • Thank you! do you know why the "less" window shows some contents that seem to be different from the columns? – Jennifer Chen May 10 '19 at 07:24
  • I'm not sure I understand the question, could you clarify showing the command output you see vs. what you expect? – Slawomir Chodnicki May 10 '19 at 07:38
  • for example: values for "topic" column should be just "Alcohol", "Cancer" etc. However, the values I extract by using ```awk '{ print $4 }' data.txt | grep -e Alcohol -e Cancer | less ```, values are different, even some are same as values from other columns. – Jennifer Chen May 10 '19 at 08:48
  • It most likely means that your file is a tab-separated file with white-spaces being possible in the payload. Do you have a format specification for your file? – Slawomir Chodnicki May 10 '19 at 09:11
  • yes, that's what I mean, it is CSV file. so is there any solution? – Jennifer Chen May 10 '19 at 10:10
  • the CSV format is generally hard to parse correctly on shell if it contains enclosure characters etc. If I had to extract a CSV column on the shell, I'd probably look into existing tools like suggested here: https://stackoverflow.com/questions/4286469/how-to-parse-a-csv-file-in-bash – Slawomir Chodnicki May 10 '19 at 10:18
  • so it is just a display issue or? I mean, shall we extract the wrong line if we leave the enclosure characters there? – Jennifer Chen May 10 '19 at 10:29
1

Using a regexp with grep:

cat data.txt|tr -s " "|cut -d " " -f 4|grep -E '(Alcohol|Cancer)'|wc -l

If you are sure that words "Alcohol" and "Cancer" only appear in the 4th column you can just do

grep -E '(Alcohol|Cancer)' data.txt|wc -l

Addition

The OP asks in the comment:

If there are many columns, and I don't know the index of them. How can I extract the columns just based on their name ("topic")?

This code will store in the variable i the column containing "topic". Essentially, the code stores the first line of data.txt as an array variable s, and then parse the array elements until it finds the desired word. (You have to increase i by one at the end because the array index starts from 0).

Note: the code works only if actually a column "topic" is found.

head -n 1 data.txt|read -a s
for (( i=0; i<${#s[@]}; i++ ))
do
  if [ "${s[$i]}" == "topic" ]
  then
     break
  fi
done
i=$(( $i + 1 ))
francesco
  • 7,189
  • 7
  • 22
  • 49
  • Not sure those patterns only appear in 4th column, so that's why I have to select "topic" column firstly, then count the lines. If there are many columns, and I don't know the index of them. How can I extract the columns just based on their name ("topic")? Thank you! – Jennifer Chen May 10 '19 at 07:32