using awk to identify disease for patients

Question

I have a tab seperated text file A.tsv of icd code matrix which the columns are patient id and icd codes, rows are observations for each patients. NA indicate the patient are not diagnosed as the icd codes

study_id 691.8 692.9 701.2 706.1
a1       1     NA    NA    2
a2       NA    NA    NA    NA
a3       NA    NA    1    NA

and a icd_code file consisting of the icd codes of interest

691.8 ICD_9
706.1 ICD_10

For a patient, if any icd codes of interest have a value (not NA), the diagnosis is coded as 1. If all icd codes of interest are NA, the diagnosis is coded as 0.

For the above example, the output should be

study_id diagnosis
a1       1
a2       0
a3       0

For the above example, the output should be

study_id diagnosis
a1       1
a2       0
a3       0

I am new to bash script and I have no clue on where should I start. How could I write a bash script with awk to realize the above question?

Now I have a solution but it seems that it is summarizing all columns but not specific icd code of interest listed in the icd file

awk -F"\t" 'BEGIN { OFS="\t"; } NR==FNR { icd_codes[$1] = $2; next; } FNR > 1 { study_id = $1; diagnosis = 0; for (i = 2; i <= NF; i++) { if ($i != "NA" && icd_codes[$i] != "") { diagnosis = 1; break; } } print study_id, diagnosis; }' "$icd_file" "$input_file" > "$output_file"

Stackoverflow is not a free coding service. It is a website where you can ask questions about your **own** code and get help about it. Please show your code and explain what is wrong with it. — Renaud Pacalet, Jul 15 '23 at 09:57
It is open now, so please shift your answer to the answer section. — Rohit Gupta, Jul 16 '23 at 12:37

Ed Morton · Answer 1 · 2023-07-16T13:25:58.517

The main problem in your script is that icd_codes[$i] != "" should be using the column header string, e.g. 691.8, as the array index but instead is using the current value in the cell for that column, e.g. 1 or NA. You need an additional array to map from column numbers to column header strings or, more efficiently as it uses fewer loop iterations per input line as I've done below, from column header strings to column numbers.

Using any awk:

$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR {
    tgtIcds[$1]
    next
}
FNR == 1 {
    for ( fldNr=2; fldNr<=NF; fldNr++ ) {
        icd = $fldNr
        if ( icd in tgtIcds ) {
            icds2fldNrs[icd] = fldNr
        }
    }
    diag = "diagnosis"
}
FNR > 1 {
    diag = 0
    for ( icd in icds2fldNrs ) {
        fldNr = icds2fldNrs[icd]
        if ( $fldNr != "NA" ) {
            diag = 1
            break
        }
    }
}
{ print $1, diag }

$ awk -f tst.awk icd_file A.tsv
study_id        diagnosis
a1      1
a2      0
a3      0

$ awk -f tst.awk icd_file input_file | column -t
study_id  diagnosis
a1        1
a2        0
a3        0

using awk to identify disease for patients

1 Answers1