Grouping the data into categories based on a column

Question

I have a tab delimited file which have 2 columns as:

new.txt
    1.01   yes
    2.00   no
    0.93   no
    1.2223 yes
    1.7211 no

I want to modify the contents of it as if there are two categories as:

new_categorized.txt
yes    no
1.01   2.00
1.2223 0.93
       1.7211

I have found a similar question with an answer in R (here) ,however I need to do it with bash or awk.. I would appreciate your help.

score 3 · Accepted Answer · answered Jan 14 '18 at 18:44

3

$ cat tst.awk
BEGIN { FS=OFS="\t" }
!($2 in label2colNr) {
    label2colNr[$2] = ++numCols
    colNr2label[numCols] = $2
}
{
    colNr = label2colNr[$2]
    val[++numRows[colNr],colNr] = $1
    maxRows = (numRows[colNr] > maxRows ? numRows[colNr] : maxRows)
}
END {
    for (colNr=1; colNr <= numCols; colNr++) {
        printf "%s%s", colNr2label[colNr], (colNr<numCols ? OFS : ORS)
    }

    for (rowNr=1; rowNr <= maxRows; rowNr++) {
        for (colNr=1; colNr <= numCols; colNr++) {
            printf "%s%s", val[rowNr,colNr], (colNr<numCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
yes     no
1.01    2.00
1.2223  0.93
        1.7211

The above will work with any awk in any shell on any UNIX system no matter how many categories you have in the 2nd field and no matter what their values are.

answered Jan 14 '18 at 18:44

Ed Morton

188,023
17
78
185

Thank you. I have tried but, it gave me the exact new.txt instead of categories.. – bapors Jan 14 '18 at 18:59
That's simply impossible. You must've copy/pasted wrong. – Ed Morton Jan 14 '18 at 19:00
1

Sorry. You are right, I had a typo. Your answer works – bapors Jan 14 '18 at 19:02
2

@EdMorton: I recycled your code right there: https://stackoverflow.com/a/48253326/3776858 – Cyrus Jan 14 '18 at 19:28
@Cyrus Glad to hear it. It should really be a FAQ :-) . – Ed Morton Jan 15 '18 at 01:40

score 2 · Answer 2 · answered Jan 14 '18 at 18:43

2

With bash, GNU grep and paste:

echo -e "yes\tno"
paste <(grep -Po '^\t\K.*(?=\tyes)' new.txt) <(grep -Po '^\t\K.*(?=\tno)' new.txt)

Output:

yes     no
1.01    2.00
1.2223  0.93
        1.7211

answered Jan 14 '18 at 18:43

Cyrus

84,225
14
89
153

Thank you for the answer. In here, how did you direct the output to another file? – bapors Jan 14 '18 at 18:56
`{ echo ...; paste ... ; } > file.txt`. Last `;` is important. – Cyrus Jan 14 '18 at 19:03

score 2 · Answer 3 · answered Jan 14 '18 at 18:58

2

GNU awk solution:

awk '{ a[$2][($2=="yes"? ++y : ++n)]=$1 }
     END{ 
         max=(y > n? y:n); 
         print "yes","no";
         for(i=1; i<=max; i++) print a["yes"][i], a["no"][i] 
     }' OFS='\t' file | column -tn

The output:

yes     no
1.01    2.00
1.2223  0.93
        1.7211

answered Jan 14 '18 at 18:58

RomanPerekhrest

88,541
4
65
105

Thank you but it complained as : awk: line 1: syntax error at or near [ awk: line 5: syntax error at or near [ – bapors Jan 14 '18 at 19:01
@bapors, 1) check your awk version; 2) check your code for errors. Here's a screenshot https://ibb.co/c5AEwm – RomanPerekhrest Jan 14 '18 at 19:04

Grouping the data into categories based on a column

3 Answers3

Linked