Script to find duplicates in a csv file

Question

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

I dont need to preserve the order of the rows. etc

I tried,

sort largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

Thanks

See: Remove duplicate rows from a large file in Python over on Stack Overflow

You should show some sample data (sanitized if necessary), including what you consider to be duplicates and non-duplicates. Specifically, if the first field (item#) is different, but the second field (UPC) is the same or if the whole record is the same, etc. You should also be more specific about what "I am not getting the expected answer" means. `uniq` only looks at the whole line unless you tell it to skip part of the line (but it doesn't use commas as field separators). Also, CSV files can be difficult to deal with unless you are using tools made for the purpose (e.g. a Python lib). — Dennis Williamson, Nov 04 '10 at 14:26

score 17 · Answer 1 · answered Sep 11 '13 at 08:22

17

Try the following:

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

answered Sep 11 '13 at 08:22

Morten

351
3
4

you have two times sort. – tmow Dec 23 '16 at 11:03
I've tried also perl version down here and is faster thant this, at least in ubuntu shell on windows10. But both are working great – user2548436 May 31 '17 at 23:48

score 14 · Answer 2 · answered Nov 04 '10 at 15:25

14

Find and print duplicate rows in Perl:

perl -ne 'print if $SEEN{$_}++' < input-file

Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file

answered Nov 04 '10 at 15:25

mob

117,087
18
149
283

It's worth noting the latter works if and only if the CSV data never contains a quoted separator `1,2,"3,3",4`. – RedGrittyBrick Nov 06 '10 at 12:11
@mob: It seems that for any group of duplicate rows, the last row isn't printed with this example. – Drew Dormann Jan 12 '12 at 15:36

score 2 · Answer 3 · answered Nov 04 '10 at 10:37

2

You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.

answered Nov 04 '10 at 10:37

Benoit

76,634
23
210
236

score 1 · Answer 4 · answered Apr 07 '17 at 07:46

Here my (very simple) script to do it with Ruby & Rake Gem.

First create a RakeFile and write this code:

namespace :csv do
  desc "find duplicates from CSV file on given column"
  task :double, [:file, :column] do |t, args|
    args.with_defaults(column: 0)
    values = []
    index  = args.column.to_i
    # parse given file row by row
    File.open(args.file, "r").each_slice(1) do |line|
      # get value of the given column
      values << line.first.split(';')[index]
    end
    # compare length with & without uniq method 
    puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
  end
end

Then to use it on the first column

$ rake csv:double["2017.04.07-Export.csv"] 
File does not contain duplicates

And to use it on the second (for example)

$ rake csv:double["2017.04.07-Export.csv",1] 
File contains duplicates

score 0 · Answer 5 · answered Nov 04 '10 at 15:46

For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

data (doesn't need to be sorted), in this example we need the first two columns to be unique:

1142,X426,Name1,Thing1
1142,X426,Name2,Thing2
1142,X426,Name3,Thing3
1142,X426,Name4,Thing4
1144,X427,Name5,Thing5
1144,X427,Name6,Thing6
1144,X427,Name7,Thing7
1144,X427,Name8,Thing8

code:

use strict;
use warnings;
use Text::CSV;

my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
                        or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<", "data.csv" or die "data.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    # insert row into row list  
    push @rows, $row;
    # join the unique keys with the
    # perl 'multidimensional array emulation' 
    # subscript  character
    my $key = join( $;, @{$row}[0,1] ); 
    # if it was just one field, just use
    # my $key = $row->[$keyfieldindex];
    # if you were checking for full line duplicates (header lines):
    # my $key = join($;, @$row);
    # if %data has an entry for the record, add it to dupes
    if (exists $data{$key}) { # duplicate 
        # if it isn't already duplicated
        # add this row and the original 
        if (not exists $dupes{$key}) {
            push @{$dupes{$key}}, $data{$key};
        }
        # add the duplicate row
        push @{$dupes{$key}}, $row;
    } else {
        $data{ $key } = $row;
    }
}

$csv->eof or $csv->error_diag();
close $fh;
# print out duplicates:
warn "Duplicate Values:\n";
warn "-----------------\n";
foreach my $key (keys %dupes) {
    my @keys = split($;, $key);
    warn "Key: @keys\n";
    foreach my $dupe (@{$dupes{$key}}) {
        warn "\tData: @$dupe\n";
    }
}

Which prints out something like this:

Duplicate Values:
-----------------
Key: 1142 X426
    Data: 1142 X426 Name1 Thing1
    Data: 1142 X426 Name2 Thing2
    Data: 1142 X426 Name3 Thing3
    Data: 1142 X426 Name4 Thing4
Key: 1144 X427
    Data: 1144 X427 Name5 Thing5
    Data: 1144 X427 Name6 Thing6
    Data: 1144 X427 Name7 Thing7
    Data: 1144 X427 Name8 Thing8

Script to find duplicates in a csv file

5 Answers5

Linked