count, groupby with sed, or awk

Question

i want to perform two different sort and count on a file, based on each line's content. 1. i need to take the first column of a .tsv file i would like to group by each line that starts with three digits, and keep only the three first digits, and for everything else, just sort and count the whole occurrence of the sentence in the first column.

Sample data:

687/878 9
890987  4
01a 55
1b  8743917
890a    34
abcdee  987
dfeqfe  fkdjald
890897  34213
6878853 834
32fasd  53891
abcdee  8794371
abd 873

result:

687 2
890 3
01a 1
1b  1
32fasd  1
abd 1
dfeqfe  1
abcdee  2

I would also appreciate a solution that would

also take into account a sample input like

687/878 9
890987  4
01a     55
1b      8743917
890a    34
abcdee  987
dfeqfe  545
890897  34213
6878853 834
(632)fasd  53891
(88)abcdee  8794371
abd     873

so the first column may have values like (,), #, ', all kind of characters

so output will have two columns, the first with the values extracted, and the second with the new count, with the new values extracted from the source file.

Again preferred output format tsv.

so i need to extract all values that start with ^\d\d\d, and then for these three first digits, sort and count unique values,

but in a second pass, also do the same for each line, that does not start with 3 digits, but this time, keep the whole columns value and sort count by it.

what i have tried: | sort | uniq -c | sort -nr for the lines that do start with ^\d\d\d, and

the same for those that do not fulfill the above regex, but is there a more elegant way using either sed or awk?

Possible duplicate of [Awk/Unix group by](https://stackoverflow.com/questions/14916826/awk-unix-group-by) — tripleee, Jan 07 '19 at 12:21
Why is this more complex? I only see one level of groups. Briefly, `awk -F '\t' '/^[0-9]{3}/ { a[substr($1,1,3)]++; next } { a[$1]++ } END { etc etc }'` — tripleee, Jan 07 '19 at 13:24
there are two groups, one that has the values ^[0-9]{3}, and one that has the whole values of all the rest lines, that do not match ^[0-9]{3} — , Jan 07 '19 at 13:28
then you shouldn't have accepted the first answer you got as that discourages people form providing additional answers since you may not even be reading them any longer. — Ed Morton, Jan 07 '19 at 14:49
Those groups are in the same array; just print it in the `END` block. — tripleee, Jan 07 '19 at 14:56
I see no requirement for separate sorting. The fact that Ed's answer is accepted suggests that this is acceptable. If there are additional requirements, probably a new question would be the way to go at this point. — tripleee, Jan 07 '19 at 15:38
@tripleee Two passages from OP's description, starting from `so i need to` -- Near the end. — Til, Jan 07 '19 at 15:51
I still don't see that this is a clear requirement. Feel free to post a new question if you need help figuring this out. — tripleee, Jan 07 '19 at 16:28

score 2 · Accepted Answer · answered Jan 07 '19 at 14:45

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ cnt[/^[0-9]{3}/ ? substr($1,1,3) : $1]++ }
END {
    for (key in cnt) {
        print (key !~ /^[0-9]{3}/), cnt[key], key, cnt[key]
    }
}

$ awk -f tst.awk file | sort -k1,2n | cut -f3-
687     1
890     2
abcdee  1

stack0114106 · Answer 2 · 2019-01-07T14:05:11.797

1

You can try Perl

$ cat nefijaka.txt
687     878     9
890987  4
890a    34
abcdee  987
$ perl -lne  ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt
687     1
890     2
abcdee  1
$

You can pipe it to sort and get the values sorted..

$ perl -lne  ' /^(\d{3})|(\S+)/; $x=$1?$1:$2; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt | sort -k2 -nr
890     2
abcdee  1
687     1

EDIT1:

$ cat nefijaka.txt2
687     878     9
890987  4
890a    34
abcdee  987
a word and then 23
$ perl -lne  ' /^(\d{3})|(.+?\t)/; $x=$1?$1:$2; $x=~s/\t//g; $kv{$x}++; END { print "$_\t$kv{$_}" for (sort keys %kv) } ' nefijaka.txt2
687     1
890     2
a word and then 1
abcdee  1
$

edited Jan 07 '19 at 14:05

answered Jan 07 '19 at 13:33

stack0114106

8,534
3
13
38

just pipe ````| sort -k2 -nr````. Check my updated answer – stack0114106 Jan 07 '19 at 13:47
the above looks good, if only i could sort results not by keys, but by occurrences found, and if it could take into account that i also have a space in certain word values. ie. aa bb cc – Jan 07 '19 at 13:53
you mean "aa bb cc" like "abcdee" in the sample input? – stack0114106 Jan 07 '19 at 13:54
'abcdee' one value, and 'one sentence with spaces in it' a second value in the input. or the sentence could have any characters – Jan 07 '19 at 13:58
the OP doesn't mention that there will be 2 groups.. there is some type in the first line..the \ is \t – stack0114106 Jan 07 '19 at 14:42
1

@stack0114106 Two passages from OP's description, starting from `so i need to` -- Near the end. – Til Jan 07 '19 at 15:01
yes, two passages, two groups, but i accepted the answer since i would like to try refine the solution myself before getting a ready answer :) – Jan 07 '19 at 15:15
@nefijaka .. appreciate that you wanted try on yourself – stack0114106 Jan 07 '19 at 15:21
1

@nefijaka.. please provide sample input for the 2 group scenario.. I'll update the asnwer to that as well – stack0114106 Jan 07 '19 at 15:26
@nefijaka.. could you please add the expected output?.. I tried Ed's solution.. its all printing 1,1,...so I'm little confused. – stack0114106 Jan 07 '19 at 15:37

count, groupby with sed, or awk

2 Answers2