0

I found this answer (Find duplicate lines in a file and count how many time each line was duplicated?) while searching and it solves the issue of duplicate lines, but I have a peculiar issue.

I have a need to find duplicates of lines that have the same line beginnings.

For example:

2501,3,0,1,0,1457695800
2501,3,0,1,0,1457789340
2502,3,0,0,0,1457695800
2502,3,0,0,0,1457789340
2503,3,0,0,0,1457789340
2504,3,0,0,0,1457789340 
2505,3,0,0,0,1457789340

In the CSV data above, 2501 and 2502 would be duplicates if the timestamp was not there.

Is there a way to find them as duplicates by considering only the first 5 fields i.e. excluding timestamp?

Community
  • 1
  • 1
ObiHill
  • 11,448
  • 20
  • 86
  • 135

1 Answers1

0

I ended up finding the answer by tacking a bunch of commands together:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d

So basically, the steps are:

  1. use cat to get the contents of the file
  2. pipe it to perl and use a regular expression to get only the capturing group (in this case, everything before the timestamp)
  3. pipe the output to sort which will sort the content
  4. use uniq with -d switch to find line duplicates

If you like you can also output the result to file:

cat my_file.csv | perl -p -i -n -e 's/^(.*),[0-9]{10}.+?$/$1/' | sort | uniq -d > line_duplicates.txt

Hope this helps.

ObiHill
  • 11,448
  • 20
  • 86
  • 135