Find duplicate lines in a file and count how many time each line was duplicated?

Question

Suppose I have a file similar to the following:

I would like to find how many times '123' was duplicated, how many times '234' was duplicated, etc. So ideally, the output would be like:

123  3 
234  2 
345  1

What language do you want to use? – VMAtm Jul 15 '11 at 19:55 — VMAtm, Jul 15 '11 at 19:55

score 978 · Accepted Answer · edited Oct 22 '14 at 15:13

978

Assuming there is one number per line:

sort <file> | uniq -c

You can use the more verbose --count flag too with the GNU version, e.g., on Linux:

sort <file> | uniq --count

edited Oct 22 '14 at 15:13

mahemoff

44,526
36
160
222

answered Jul 15 '11 at 19:56

wonk0

13,402
1
21
15

3

This is what I do however algorithmically this is doesnt seem to be the most efficient approach (O(n log n)*avg_line_len where n is number of lines). I'm working on files that are several gigabytes large, so performance is a key issue. I wonder whether there is a tool that does just the counting in a single pass using a prefix tree (in my case strings often have common prefixes) or similar, that should do the trick in O(n) * avg_line_len. Does anyone know such a commandline tool? – Droggl Nov 20 '13 at 17:27
32

An additional step is to pipe the output of that into a final 'sort -n' command. That will sort the results by which lines occur most often. – samoz Jun 11 '14 at 18:24
123

If you want to only print duplicate lines, use 'uniq -d' – DmitrySandalov Sep 03 '14 at 01:20
8

If you want to again sort the result, you may use `sort` again like: `sort | uniq -c | sort -n` – Abhishek Kashyap Jan 09 '19 at 09:54
3

if @DmitrySandalov hat not mentioned `-d` I would have taken `… | uniq -c | grep -v '^\s*1' ` (`-v` means inverse regexp, that denies matches (not verbose, not version :)) – Frank N Sep 06 '21 at 12:37
by the way, you can use `-c -d` and print duplicate lines *and* the count – Andrey Regentov Oct 08 '21 at 10:45
Is there a way to ignore the first part of a text file? My logs have timestamps, so I want to ignore the first n chars. – Brunis Mar 03 '23 at 12:17
1

@Brunis https://stackoverflow.com/a/339941/7030591 – SADIK KUZU Mar 03 '23 at 21:46

score 554 · Answer 2 · edited May 23 '17 at 11:54

554

This will print duplicate lines only, with counts:

sort FILE | uniq -cd

or, with GNU long options (on Linux):

sort FILE | uniq --count --repeated

on BSD and OSX you have to use grep to filter out unique lines:

sort FILE | uniq -c | grep -v '^ *1 '

For the given example, the result would be:

  3 123
  2 234

If you want to print counts for all lines including those that appear only once:

sort FILE | uniq -c

or, with GNU long options (on Linux):

sort FILE | uniq --count

For the given input, the output is:

  3 123
  2 234
  1 345

In order to sort the output with the most frequent lines on top, you can do the following (to get all results):

sort FILE | uniq -c | sort -nr

or, to get only duplicate lines, most frequent first:

sort FILE | uniq -cd | sort -nr

on OSX and BSD the final one becomes:

sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr

edited May 23 '17 at 11:54

Community

1
1

answered Jun 07 '13 at 09:06

Andrea

12,296
4
32
39

1

Good point with the --repeated or -d option. So much more accurate than using "|grep 2" or similar! – Lauri Oct 22 '13 at 10:42
How I can modify this command to retrieve all lines whose repetition count is more than 100 ? – Black_Rider Nov 27 '13 at 07:57
@Black_Rider Adding `| sort -n` or `| sort -nr` to the pipe will sort the output by repetition count (ascending or descending respectively). This is not what you're asking but I thought it might help. – Andrea Nov 27 '13 at 08:21
1

@Black_Rider awk seems able to do all kind of calculations: in your case you could do `| awk '$1>100'` – Andrea Nov 27 '13 at 11:07
@fionbio `sort FILE | uniq -cd` should work on OSX too – Andrea Apr 01 '15 at 18:07
4

@fionbio Looks like [you can't use -c and -d together on OSX uniq](https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man1/uniq.1.html). Thanks for pointing out. You can [use grep to filter out unique lines](http://stackoverflow.com/a/5699355/2093341): `sort FILE | uniq -c | grep -v '^ *1 '` – Andrea Apr 02 '15 at 10:16
Piping to `awk '$1>1'` seems a lot better than `grep -v '^ *1 '` to me. It allows us to change the minimum duplicate count with ease and works flawlessly even on macOS. :) – Paulo Freitas Jun 08 '17 at 08:51
`sort FILE | uniq -c | grep -v '^ *1 ' | sort -nr` is beautiful! – Gabriel Borges Oliveira Jul 10 '19 at 13:58
on OSX you can use `uniq -d` if you don't care about the count – Ji Fang Nov 23 '21 at 23:48

kenorb · Answer 3 · 2013-07-10T21:06:14.213

77

To find and count duplicate lines in multiple files, you can try the following command:

sort <files> | uniq -c | sort -nr

or:

cat <files> | sort | uniq -c | sort -nr

edited Jul 10 '13 at 21:06

answered May 14 '13 at 13:26

kenorb

155,785
88
678
743

score 37 · Answer 4 · answered Apr 01 '15 at 13:01

37

Via awk:

awk '{dups[$1]++} END{for (num in dups) {print num,dups[num]}}' data

In awk 'dups[$1]++' command, the variable $1 holds the entire contents of column1 and square brackets are array access. So, for each 1st column of line in data file, the node of the array named dups is incremented.

And at the end, we are looping over dups array with num as variable and print the saved numbers first then their number of duplicated value by dups[num].

Note that your input file has spaces on end of some lines, if you clear up those, you can use $0 in place of $1 in command above :)

answered Apr 01 '15 at 13:01

αғsнιη

2,627
2
25
38

2

Isn't this a bit of overkill considering that we have `uniq`? – Nathan Fellman Jun 06 '16 at 07:08
14

`sort | uniq` and the awk solution have quite different performance & resource trade-offs: if the files are large and the number of different lines is small, the awk solution is a lot more efficient. It is linear in the number of lines and the space usage is linear in the number of different lines. OTOH, the awk solution needs to keeps all the different lines in memory, while (GNU) sort can resort to temp files. – Lars Noschinski Apr 21 '17 at 05:55

score 17 · Answer 5 · edited Jun 04 '22 at 12:20

17

In Windows, using "Windows PowerShell", I used the command mentioned below to achieve this

Get-Content .\file.txt | Group-Object | Select Name, Count

Also, we can use the where-object Cmdlet to filter the result

Get-Content .\file.txt | Group-Object | Where-Object { $_.Count -gt 1 } | Select Name, Count

edited Jun 04 '22 at 12:20

RockPaperLz- Mask it or Casket

5,361
10
45
69

answered May 05 '17 at 16:12

vineel

3,483
2
29
33

can you delete all occurrences of the duplicates except the last one...without changing the sort order of the file? – jparram Jun 30 '17 at 14:53

score 14 · Answer 6 · edited Jun 04 '22 at 12:20

14

To find duplicate counts, use this command:

sort filename | uniq -c | awk '{print $2, $1}'

edited Jun 04 '22 at 12:20

RockPaperLz- Mask it or Casket

5,361
10
45
69

answered Jul 20 '20 at 05:54

Mohammed Nazim

153
2
6

score 7 · Answer 7 · answered Jul 15 '11 at 19:57

7

Assuming you've got access to a standard Unix shell and/or cygwin environment:

tr -s ' ' '\n' < yourfile | sort | uniq -d -c
       ^--space char

Basically: convert all space characters to linebreaks, then sort the tranlsated output and feed that to uniq and count duplicate lines.

answered Jul 15 '11 at 19:57

Marc B

356,200
43
426
500

I guess this solution was tailored to a specific case of your own? i.e. you've got a list of words separated by spaces or newlines only. If it's only a list of numbers separated by newlines (no spaces) it will work fine there, but obviously your solution will treat lines containing spaces differently. – mwfearnley May 18 '21 at 12:29

Find duplicate lines in a file and count how many time each line was duplicated?

7 Answers7

Linked

Related