Find the unique values in a column and replace the unique values with numbers

Question

I have a tab limited data that reads

1 0 0 1 1 Black Swan
0 0 1 0 0 Golden Duck
1 0 0 1 0 Brown Eagle
0 0 1 0 1 Golden Duck
1 0 0 1 0 Black Swan
1 0 1 0 0 Golden Duck
1 0 0 1 1 Sparrow

The last column is a combination of one or more words separated by space. I want to count the number of unique values in the last column and replace that with a number which is unique to that group. I know I can count the and list the numbers using

awk -F '\t' '{print $NF}'  infile | sort | uniq | wc -l

But how do I replace with numbers? For example, replace all Black Swan by 1, replace all Golden Duck by 2 etc. I want the result to be :

1 0 0 1 1 1
0 0 1 0 0 2
1 0 0 1 0 3
0 0 1 0 1 2
1 0 0 1 0 1
1 0 1 0 0 2
1 0 0 1 1 4

and I also want to generate the list of numbers given to specific values like

Black Swan 1
Golden Duck 2
Brown Eagle 3
Sparrow 4

score 5 · Accepted Answer · answered May 09 '14 at 14:03

You can use an associate array to increment a counter for each different name:

awk '
    BEGIN { 
        FS = OFS = "\t" 
        i = 0
    }
    {
        if (! names[$NF]) {
            names[$NF] = ++i
        }
        $NF = names[$NF]
        print $0
    }
    END {
        for (name in names) {
            printf "%s %d\n", name, names[name]
        }
    }
' infile

It yields:

1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck 2
Brown Eagle 3
Sparrow 4
Black Swan 1

Agreed. No need to init `i` obviously, and the printf at the end could just be a print but nbd. — Ed Morton, May 09 '14 at 15:30

score 4 · Answer 2 · edited May 23 '17 at 11:49

I started writing this so I'll finish:

awk '
BEGIN {FS = OFS = "\t"}
{
    last[$NF] = (last[$NF] ? last[$NF] : ++cnt)
    $NF = last[$NF]
    line[NR] = $0
}
END {
    for (nr=1; nr<=NR; nr++) 
        print line[nr]
    for (name in last) 
        print name, last[name]
}' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Brown Eagle     3
Black Swan      1
Sparrow         4
Golden Duck     2

Update:

Here is a perl alternate:

perl -F'\t' -lane '
    $h{$F[-1]} = ++$c unless exists $h{$F[-1]}; 
    $F[-1] = $h{$F[-1]}; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file
1       0       0       1       1       1
0       0       1       0       0       2
1       0       0       1       0       3
0       0       1       0       1       2
1       0       0       1       0       1
1       0       1       0       0       2
1       0       0       1       1       4
Golden Duck  2
Brown Eagle  3
Black Swan  1
Sparrow  4

Here is another update based on mpapec's excellent comment:

perl -F'\t' -lane '
    $F[-1] = $h{$F[-1]} ||= ++$c; 
    print join "\t", @F }{ print "$_  $h{$_}" for keys %h
' file

+1, just `$h{$F[-1]} = $h{$F[-1]} ? $h{$F[-1]} : ++$c;` can be written as `$h{$F[-1]} = $h{$F[-1]} || ++$c;` or `$h{$F[-1]} ||= ++$c;` for short, and `splice @F, -1, 1, $h{$F[-1]};` as `$F[-1] = $h{$F[-1]}`. For **golfing only** purposes that can be further shortened `$F[-1] = $h{$F[-1]} ||= ++$c;` — mpapec, May 09 '14 at 16:36

score 1 · Answer 3 · edited May 23 '17 at 12:12

1

What you want to do is create a set of the unique data. A set is a dictionary, or hash table, with all unique elements. After you create your set, you can search through it and replace the string with the appropriate value.

Here is another link for sets to help you out:

http://world.std.com/~swmcd/steven/perl/pm/set.html

edited May 23 '17 at 12:12

Community

1
1

answered May 09 '14 at 14:03

Josh

1,032
2
12
24

Find the unique values in a column and replace the unique values with numbers

3 Answers3