Find duplicate line starts in a text file

Question

I found this answer (Find duplicate lines in a file and count how many time each line was duplicated?) while searching and it solves the issue of duplicate lines, but I have a peculiar issue.

I have a need to find duplicates of lines that have the same line beginnings.

For example:

2501,3,0,1,0,1457695800
2501,3,0,1,0,1457789340
2502,3,0,0,0,1457695800
2502,3,0,0,0,1457789340
2503,3,0,0,0,1457789340
2504,3,0,0,0,1457789340 
2505,3,0,0,0,1457789340

In the CSV data above, 2501 and 2502 would be duplicates if the timestamp was not there.

Is there a way to find them as duplicates by considering only the first 5 fields i.e. excluding timestamp?

score 0 · Answer 1 · answered Mar 14 '16 at 14:54

I ended up finding the answer by tacking a bunch of commands together:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d

So basically, the steps are:

use cat to get the contents of the file
pipe it to perl and use a regular expression to get only the capturing group (in this case, everything before the timestamp)
pipe the output to sort which will sort the content
use uniq with -d switch to find line duplicates

If you like you can also output the result to file:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d > line_duplicates.txt

Hope this helps.

Find duplicate line starts in a text file

1 Answers1