0

I have this table in a MySQL database.

1   test1.csv   Jan     Thomas      Sales       5000
2   test1.csv   Jan     Michael     Sales       200
3   test1.csv   Thomas  John        Technology  12900
4   test2.csv   Robert  James       Technology  5500
5   test2.csv   Robert  Albertson   Technology  6000
6   test2.csv   Mark    Jeffries    Technology  900
7   test2.csv   Ted     James       Technology  10000
8   test2.csv   Mayla   Arthurs     Technology  7000
9   test2.csv   Mayla   Smith       Technology  9500
10  test3.csv   Mayla   Anthony     Technology  3000
11  test3.csv   Mayla   Mark        Technology  3000
12  test4.csv   Mayla   Roberts     Technology  8500
13  test4.csv   Anthony Anderson    Marketing   9500
14  test5.csv   Anthony Smith       Technology  6000
15  test5.csv   Jan     Thomas      Sales       5000
16  test5.csv   Jan     Michael     Sales       200
17  test5.csv   Thomas  John        Technology  12900
18  test1.csv   Jan     Michael     Sales       8000
19  test1.csv   Thomas  John        Technology  1540
20  test2.csv   Mayla   Smith       Technology  10500
21  test3.csv   Mayla   Anthony     Technology  5600
22  test4.csv   Anthony Anderson    Marketing   2500
23  test5.csv   Brian   Earl        HR          1200
24  test5.csv   John    Smith       HR_Sales    2000
25  test6.csv   Jan     Thomas      HR_Sales    12000
26  test6.csv   Jan     Michael     Education   1500
27  test7.csv   Thomas  John        HR_Sales    1000

The SQL code to create the table is at the end of this post. Each record consists of filename, firstname, lastname, dept, salary. Sometimes, the same record is present in multiple files - I cannot have these duplicate records.

As you can see: id = 15, 16, 17 are duplicates of id = 1, 2, 3 respectively.

I need to drop duplicates where the filename is different but the record is the same.

Additional Information:

  1. I cannot use DELETE FROM employee WHERE id IN (15, 16, 17) because I do not know which rows will be duplicated.
  2. The table is constantly being updated by appending more *.csv files to it. This means that, if I create a new index column, then I cannot append *.csv files that contain duplicates of the records already in the database. Thus, I cannot use an index column or GROUP BY().

Is there a way to drop the duplicated rows, without using the PK column?

SQL code to create the above table:

CREATE SCHEMA dupl_test;

USE dupl_test;

create table employee (
id INT AUTO_INCREMENT PRIMARY KEY,
filename varchar(20),
firstname varchar(20),
lastname varchar(20),
dept varchar(10),
salary int(10)
);

insert into employee values(1,'test1.csv','Jan','Thomas','Sales',5000);
insert into employee values(2,'test1.csv','Jan','Michael','Sales',200);
insert into employee values(3,'test1.csv','Thomas','John','Technology',12900);
insert into employee values(4,'test2.csv','Robert','James','Technology',5500);
insert into employee values(5,'test2.csv','Robert','Albertson','Technology',6000);
insert into employee values(6,'test2.csv','Mark','Jeffries','Technology',900);
insert into employee values(7,'test2.csv','Ted','James','Technology',10000);
insert into employee values(8,'test2.csv','Mayla','Arthurs','Technology',7000);
insert into employee values(9,'test2.csv','Mayla','Smith','Technology',9500);
insert into employee values(10,'test3.csv','Mayla','Anthony','Technology',3000);
insert into employee values(11,'test3.csv','Mayla','Mark','Technology',3000);
insert into employee values(12,'test4.csv','Mayla','Roberts','Technology',8500);
insert into employee values(13,'test4.csv','Anthony', 'Anderson','Marketing',9500);
insert into employee values(14,'test5.csv','Anthony','Smith','Technology',6000);
insert into employee values(15,'test5.csv','Jan','Thomas','Sales',5000);
insert into employee values(16,'test5.csv','Jan','Michael','Sales',200);
insert into employee values(17,'test5.csv','Thomas','John','Technology',12900);
insert into employee values(18,'test1.csv','Jan','Michael','Sales',8000);
insert into employee values(19,'test1.csv','Thomas','John','Technology',1540);
insert into employee values(20,'test2.csv','Mayla','Smith','Technology',10500);
insert into employee values(21,'test3.csv','Mayla','Anthony','Technology',5600);
insert into employee values(22,'test4.csv','Anthony', 'Anderson','Marketing',2500);
insert into employee values(23,'test5.csv','Brian','Earl','HR',1200);
insert into employee values(24,'test5.csv','John','Smith','HR_Sales',2000);
insert into employee values(25,'test6.csv','Jan','Thomas','HR_Sales',12000);
insert into employee values(26,'test6.csv','Jan','Michael','Education',1500);
insert into employee values(27,'test7.csv','Thomas','John','HR_Sales',1000);
edesz
  • 11,756
  • 22
  • 75
  • 123
  • Sounds like you need a separate table to hold a mapping between employees and files, then the same employee can be in multiple files. – Hunter McMillen Apr 09 '15 at 00:45
  • you can just put a unique constraint over all the fields except the id and filename. when you load your csvs, you can simply tell it to ignore duplicate row errors and continue – pala_ Apr 09 '15 at 00:52

1 Answers1

0

You can delete the duplicates using delete with join in MySQL:

delete e
    from employee e left join
         (select firstname, lastname, dept, salary, min(filename) as filename
          from employee e
          group by firstname, lastname, dept, salary
         ) tokeep
         on e.firstname = tokeep.firstname and e.lastname = tokeep.lastname and
            e.dept = tokeep.dept and e.salary = tokeep.salary and
            tokeep.filename = e.filename
    where tokeep.filename is null;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786