16

I have a 40 MB csv file with 50,000 records. Its a giant product listing. Each row has close to 20 fields. [Item#, UPC, Desc, etc]

How can I,

a) Find and Print duplicate rows. [This file is a large appended file, so I have multiple headers included in the file which I need to remove, so I wanted to know exact rows which are duplicate first.]

b) Find and Print duplicate rows based on a column. [See if a UPC is assigned to multiple products]

I need to run the command or script on the server and I have Perl and Python installed. Even bash script or command will work for me too.

I dont need to preserve the order of the rows. etc

I tried,

sort largefile.csv | uniq -d

to get the duplicates, But I am not getting the expected answer.

Ideally I would like bash script or command, but if any one has any other suggestion, that would be great too.

Thanks


See: Remove duplicate rows from a large file in Python over on Stack Overflow

Community
  • 1
  • 1
  • 7
    why isn't `uniq -d` giving you the expected answer? – dogbane Nov 04 '10 at 11:01
  • You should show some sample data (sanitized if necessary), including what you consider to be duplicates and non-duplicates. Specifically, if the first field (item#) is different, but the second field (UPC) is the same or if the whole record is the same, etc. You should also be more specific about what "I am not getting the expected answer" means. `uniq` only looks at the whole line unless you tell it to skip part of the line (but it doesn't use commas as field separators). Also, CSV files can be difficult to deal with unless you are using tools made for the purpose (e.g. a Python lib). – Dennis Williamson Nov 04 '10 at 14:26

5 Answers5

17

Try the following:

# Sort before using the uniq command
sort largefile.csv | sort | uniq -d

uniq is a very basic command and only reports uniqueness / duplicates that are next to each other.

Morten
  • 351
  • 3
  • 4
14

Find and print duplicate rows in Perl:

perl -ne 'print if $SEEN{$_}++' < input-file

Find and print rows with duplicate columns in Perl -- let's say the 5th column of where fields are separated by commas:

perl -F/,/ -ane 'print if $SEEN{$F[4]}++' < input-file
mob
  • 117,087
  • 18
  • 149
  • 283
2

You could possibly use SQLite shell to import your csv file and create indexes to perform SQL commands faster.

Benoit
  • 76,634
  • 23
  • 210
  • 236
1

Here my (very simple) script to do it with Ruby & Rake Gem.

First create a RakeFile and write this code:

namespace :csv do
  desc "find duplicates from CSV file on given column"
  task :double, [:file, :column] do |t, args|
    args.with_defaults(column: 0)
    values = []
    index  = args.column.to_i
    # parse given file row by row
    File.open(args.file, "r").each_slice(1) do |line|
      # get value of the given column
      values << line.first.split(';')[index]
    end
    # compare length with & without uniq method 
    puts values.uniq.length == values.length ? "File does not contain duplicates" : "File contains duplicates"
  end
end

Then to use it on the first column

$ rake csv:double["2017.04.07-Export.csv"] 
File does not contain duplicates

And to use it on the second (for example)

$ rake csv:double["2017.04.07-Export.csv",1] 
File contains duplicates
alexandre-rousseau
  • 2,321
  • 26
  • 33
0

For the second part: read the file with Text::CSV into a hash keyed on your unique key(s), check whether a value exists for the hash before adding it. Something like this:

data (doesn't need to be sorted), in this example we need the first two columns to be unique:

1142,X426,Name1,Thing1
1142,X426,Name2,Thing2
1142,X426,Name3,Thing3
1142,X426,Name4,Thing4
1144,X427,Name5,Thing5
1144,X427,Name6,Thing6
1144,X427,Name7,Thing7
1144,X427,Name8,Thing8

code:

use strict;
use warnings;
use Text::CSV;

my %data;
my %dupes;
my @rows;
my $csv = Text::CSV->new ()
                        or die "Cannot use CSV: ".Text::CSV->error_diag ();

open my $fh, "<", "data.csv" or die "data.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
    # insert row into row list  
    push @rows, $row;
    # join the unique keys with the
    # perl 'multidimensional array emulation' 
    # subscript  character
    my $key = join( $;, @{$row}[0,1] ); 
    # if it was just one field, just use
    # my $key = $row->[$keyfieldindex];
    # if you were checking for full line duplicates (header lines):
    # my $key = join($;, @$row);
    # if %data has an entry for the record, add it to dupes
    if (exists $data{$key}) { # duplicate 
        # if it isn't already duplicated
        # add this row and the original 
        if (not exists $dupes{$key}) {
            push @{$dupes{$key}}, $data{$key};
        }
        # add the duplicate row
        push @{$dupes{$key}}, $row;
    } else {
        $data{ $key } = $row;
    }
}

$csv->eof or $csv->error_diag();
close $fh;
# print out duplicates:
warn "Duplicate Values:\n";
warn "-----------------\n";
foreach my $key (keys %dupes) {
    my @keys = split($;, $key);
    warn "Key: @keys\n";
    foreach my $dupe (@{$dupes{$key}}) {
        warn "\tData: @$dupe\n";
    }
}

Which prints out something like this:

Duplicate Values:
-----------------
Key: 1142 X426
    Data: 1142 X426 Name1 Thing1
    Data: 1142 X426 Name2 Thing2
    Data: 1142 X426 Name3 Thing3
    Data: 1142 X426 Name4 Thing4
Key: 1144 X427
    Data: 1144 X427 Name5 Thing5
    Data: 1144 X427 Name6 Thing6
    Data: 1144 X427 Name7 Thing7
    Data: 1144 X427 Name8 Thing8
MkV
  • 3,046
  • 22
  • 16