1
cat scorecard.csv|cut -d , -f6|sort -n|uniq -c

gives me word counts without repeats while,

cat scorecard.csv|cut -d , -f6|uniq -c|sort -n

gives me word counts but there are repeats and the count is not accurate. Why is this so, when they are very similar?

Here is some output for sort first then uniq-

  9 AK
 94 AL
 89 AR
  1 AS
122 AZ
714 CA
113 CO
 81 CT
 24 DC
 20 DE
409 FL
  1 FM
174 GA
  3 GU
 24 HI
 88 IA
 36 ID
275 IL
151 IN
 84 KS
100 KY
130 LA
178 MA
 91 MD
 40 ME
  1 MH
194 MI
124 MN
179 MO
  1 MP
 61 MS
 33 MT
187 NC
 29 ND
 49 NE
 40 NH
160 NJ
 48 NM
 41 NV
449 NY
313 OH
127 OK
 86 OR
377 PA
137 PR
  1 PW
 24 RI
108 SC
 30 SD
  1 STABBR
176 TN
443 TX
 75 UT
177 VA
  2 VI
 26 VT
117 WA
109 WI
 73 WV
 10 WY

Here is some output for uniq first then sort-

  3 CA
  3 CA
  3 CA
  3 CA
  3 CO
  3 CO
  3 CO
  3 CT
  3 CT
  3 CT
  3 FL
  3 IL
  3 IL
  3 IL
  3 IL
  3 IL
  3 KY
  3 MA
  3 MA
  3 MI
  3 MI
  3 MI
  3 MO
  3 MO
  3 MO
  3 MO
  3 NC
  3 NJ
  3 NJ
  3 NJ
  3 NY
  3 NY
  3 NY
  3 NY
  3 OH
  3 OH
  3 OH
  3 OH
  3 OH
  3 PA
  3 PA
  3 PA
  3 PR
  3 SC
  3 TN
  3 TN
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 TX
  3 UT
  3 UT
  3 VA
  3 VA
  3 WA
  3 WA
  3 WA
  3 WI
  3 WI
  3 WV
  4 AZ
  4 CA
  4 CA
  4 CA
  4 CA
  4 FL
  4 IL
  4 IN
  4 KS
  4 MA
  4 MD
  4 MI
  4 MS
  4 NY
  4 NY
  4 PR
  4 TX
  4 TX
  4 TX
  4 UT
  4 WI
  5 AL
  5 AR
  5 CA
  5 CO
  5 FL
  5 FL
  5 FL
  5 MO
  5 NY
  5 OK
  5 PA
  5 PR
  5 TX
  6 AK
  6 CA
  6 CT
  6 FL
  6 IL
  6 NC
  6 OH
  6 OK
  6 PA
  6 PR
  6 TX
  6 TX
  6 VA
  7 FL
  7 IL
  7 NY
  7 OH
  7 TX
  7 TX
  7 TX
  8 CA
  8 CA
  8 CA
  8 FL
  8 FL
  8 GA
  8 OH
  8 PA
  9 CA
  9 CA
  9 DE
  9 FL
  9 FL
  9 IN
  9 MO
 10 OK
 10 VA
 10 WY
 11 MO
 11 NV
 12 AZ
 12 DC
 14 CA
 14 CA
 14 HI
 14 NY
 14 PA
 14 RI
 15 ID
 15 MN
 16 MO
 19 IN
 21 VT
 22 CA
 22 FL
 22 MI
 23 UT
 24 CA
 24 IN
 24 MT
 25 ND
 25 OH
 26 IA
 27 SD
 29 KS
 29 ME
 30 KS
 31 NH
 32 NM
 37 NE
 38 AZ
 39 MS
 42 CT
 43 WV
 45 OH
 49 IN
 50 IA
 56 OK
 58 CO
 59 AL
 59 MD
 61 AR
 61 PR
 62 OR
 62 SC
 63 PA
 63 WI
 64 LA
 65 KY
 65 WA
 66 FL
 67 FL
 72 MO
 81 NJ
 82 GA
 85 MN
 90 VA
100 TN
106 MI
123 OH
125 MA
125 NC
169 IL
184 PA
185 TX
288 NY
301 CA
wjandrea
  • 28,235
  • 9
  • 60
  • 81
  • Related: [Find duplicate lines in a file and count how many time each line was duplicated?](https://stackoverflow.com/q/6712437/4518341) – wjandrea Jul 14 '19 at 03:04
  • 1
    The big difference is `uniq` will `"Filter adjacent matching lines from INPUT"` (if they are not sorted first -- you see the problem....) – David C. Rankin Jul 14 '19 at 06:29
  • If you look at the man page for uniq – `man uniq` – you'll see this (or close to it, depending on which OS you're using): `uniq -- report or filter out repeated lines in a file`. The important point is "repeated lines"; `uniq` by itself does not re-arrange the contents. – Kaan Jul 15 '19 at 20:09

2 Answers2

2

Adding to what @wjandrea said, sort -n sorts numerically rather than alphabetically, so sort -n | uniq -c is meaningless, because the input to sort -n doesn't contain the numbers.

I suspect what you want is

cat scorecard.csv | cut -d , -f6 | sort | uniq -c | sort -n
root
  • 5,528
  • 1
  • 7
  • 15
  • `sort -n` worked for OP though. It seems to sort numerically first, then alphabetically. BTW, nice analysis :) That pipeline is bang-on. – wjandrea Jul 14 '19 at 03:09
  • 1
    @wjandrea thanks! `sort -n` worked because `-n` just tells sort that the string " 3" is smaller than "300", if no numbers are there then it has no effect. – root Jul 14 '19 at 03:12
1

You have some non-adjacent duplicate lines in the input.

From man uniq:

Filter adjacent matching lines ...

With no options, matching lines are merged to the first occurrence.

...

Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.

Also info uniq:

By default, uniq prints its input lines, except that it discards all but the first of adjacent repeated lines, so that no output lines are repeated. Optionally, it can instead discard lines that are not repeated, or all repeated lines.

The input need not be sorted, but repeated input lines are detected only if they are adjacent. If you want to discard non-adjacent duplicate lines, perhaps you want to use sort -u.

wjandrea
  • 28,235
  • 9
  • 60
  • 81