I have a tab seperated text file A.tsv of icd code matrix which the columns are patient id and icd codes, rows are observations for each patients. NA indicate the patient are not diagnosed as the icd codes
study_id 691.8 692.9 701.2 706.1
a1 1 NA NA 2
a2 NA NA NA NA
a3 NA NA 1 NA
and a icd_code file consisting of the icd codes of interest
691.8 ICD_9
706.1 ICD_10
For a patient, if any icd codes of interest have a value (not NA), the diagnosis is coded as 1. If all icd codes of interest are NA, the diagnosis is coded as 0.
For the above example, the output should be
study_id diagnosis
a1 1
a2 0
a3 0
For the above example, the output should be
study_id diagnosis
a1 1
a2 0
a3 0
I am new to bash script and I have no clue on where should I start. How could I write a bash script with awk to realize the above question?
Now I have a solution but it seems that it is summarizing all columns but not specific icd code of interest listed in the icd file
awk -F"\t" 'BEGIN { OFS="\t"; } NR==FNR { icd_codes[$1] = $2; next; } FNR > 1 { study_id = $1; diagnosis = 0; for (i = 2; i <= NF; i++) { if ($i != "NA" && icd_codes[$i] != "") { diagnosis = 1; break; } } print study_id, diagnosis; }' "$icd_file" "$input_file" > "$output_file"