Basic grep/sed/awk script to find duplicates

Question

I'm starting out with regular expressions and grep and I want to find out how to do this. I have this list:

1. 12493 6530
2. 12475 5462
3. 12441 5450
4. 12413 5258
5. 12478 4454
6. 12416 3859
7. 12480 3761
8. 12390 3746
9. 12487 3741
10. 12476 3557
...

And I want to get the contents of the middle column only (so NF==2 in awk?). The delimiter here is a space.

I then want to find which numbers are there more than once (duplicates). How would I go about doing that? Thank you, I'm a beginner.

This is more of a programming task than a regex exercise, as regex won't help you at all here. — Qtax, Nov 11 '14 at 22:22
What would your expected output be given that input file? If the answer is "nothing" then edit your question to provide an input file that WOULD produce output from the tool you want and the associated expected output. — Ed Morton, Nov 11 '14 at 23:04

score 4 · Answer 1 · edited May 23 '17 at 10:26

4

Using awk :

awk '{count[$2]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file

But you don't have duplicate numbers in the 2nd column.

the second column in awk is $2
count[$2]++ increment an array value with the treated number as key
the END block is executed @the end, and we test each array values to find those having +1

And with a better concision (credits for jthill)

awk '++count[$2]==2{print $2}' file

edited May 23 '17 at 10:26

Community

1
1

answered Nov 11 '14 at 22:20

Gilles Quénot

173,512
41
224
223

1

`awk '++count[$2]==2 { print $2 }'` – jthill Nov 12 '14 at 03:25
Yeah, Lovely jthill ! – Gilles Quénot Nov 12 '14 at 07:54
@sputnick You should use `==2`, since if there are `3` or more equal, your solution would print the same value more than one time. – Jotne Nov 12 '14 at 07:59

score 2 · Answer 2 · answered Nov 11 '14 at 22:32

Using perl:

perl -anE '$h{$F[1]}++; END{ say for grep $h{$_} > 1, keys %h }'

Iterate the lines and build a hash (%h/$h{...}) with the count (++) of the second column values ($F[1]), and after that (END{ ... }) say all hash keys with count ($h{$_}) which is > 1.

vincentleest · Answer 3 · 2014-11-12T02:47:44.977

-1

With the data stored in test,

Using a combination of awk, uniq and grep commands

 cat test | awk -v x=2 '{print $x}' | sort | uniq -c | sed  '/^1 /d' | awk -v x=2 '{print $x}'

Explanation:

awk -v x=2 '{print $x}'

selects 2nd column

uniq -c

counts the appearance of each number

sed  '/^1 /d'

deletes all the entries with only one appearance

awk -v x=2 '{print $x}'

removes the number count with awk again

edited Nov 12 '14 at 02:47

answered Nov 11 '14 at 22:35

vincentleest

925
1
8
18

`grep -v "1 "` will remove any lines that appear 11 times. – Ed Morton Nov 12 '14 at 00:32
@glennjackman , I assumed the op wanted to keep the orders of the entries, so i didn't sort it, but i added the sort into the answer as well. – vincentleest Nov 12 '14 at 02:49
`uniq -c` only counts *consecutive* duplicates. If the input is unsorted, the results may not be so useful. – glenn jackman Nov 12 '14 at 03:08

Basic grep/sed/awk script to find duplicates

3 Answers3

Linked