I have this table in a MySQL database.
1 test1.csv Jan Thomas Sales 5000
2 test1.csv Jan Michael Sales 200
3 test1.csv Thomas John Technology 12900
4 test2.csv Robert James Technology 5500
5 test2.csv Robert Albertson Technology 6000
6 test2.csv Mark Jeffries Technology 900
7 test2.csv Ted James Technology 10000
8 test2.csv Mayla Arthurs Technology 7000
9 test2.csv Mayla Smith Technology 9500
10 test3.csv Mayla Anthony Technology 3000
11 test3.csv Mayla Mark Technology 3000
12 test4.csv Mayla Roberts Technology 8500
13 test4.csv Anthony Anderson Marketing 9500
14 test5.csv Anthony Smith Technology 6000
15 test5.csv Jan Thomas Sales 5000
16 test5.csv Jan Michael Sales 200
17 test5.csv Thomas John Technology 12900
18 test1.csv Jan Michael Sales 8000
19 test1.csv Thomas John Technology 1540
20 test2.csv Mayla Smith Technology 10500
21 test3.csv Mayla Anthony Technology 5600
22 test4.csv Anthony Anderson Marketing 2500
23 test5.csv Brian Earl HR 1200
24 test5.csv John Smith HR_Sales 2000
25 test6.csv Jan Thomas HR_Sales 12000
26 test6.csv Jan Michael Education 1500
27 test7.csv Thomas John HR_Sales 1000
The SQL code to create the table is at the end of this post. Each record consists of filename, firstname, lastname, dept, salary. Sometimes, the same record is present in multiple files - I cannot have these duplicate records.
As you can see: id = 15, 16, 17 are duplicates of id = 1, 2, 3 respectively.
I need to drop duplicates where the filename is different but the record is the same.
Additional Information:
- I cannot use
DELETE FROM employee WHERE id IN (15, 16, 17)
because I do not know which rows will be duplicated. - The table is constantly being updated by appending more
*.csv
files to it. This means that, if I create a new index column, then I cannot append*.csv
files that contain duplicates of the records already in the database. Thus, I cannot use an index column orGROUP BY()
.
Is there a way to drop the duplicated rows, without using the PK column?
SQL code to create the above table:
CREATE SCHEMA dupl_test;
USE dupl_test;
create table employee (
id INT AUTO_INCREMENT PRIMARY KEY,
filename varchar(20),
firstname varchar(20),
lastname varchar(20),
dept varchar(10),
salary int(10)
);
insert into employee values(1,'test1.csv','Jan','Thomas','Sales',5000);
insert into employee values(2,'test1.csv','Jan','Michael','Sales',200);
insert into employee values(3,'test1.csv','Thomas','John','Technology',12900);
insert into employee values(4,'test2.csv','Robert','James','Technology',5500);
insert into employee values(5,'test2.csv','Robert','Albertson','Technology',6000);
insert into employee values(6,'test2.csv','Mark','Jeffries','Technology',900);
insert into employee values(7,'test2.csv','Ted','James','Technology',10000);
insert into employee values(8,'test2.csv','Mayla','Arthurs','Technology',7000);
insert into employee values(9,'test2.csv','Mayla','Smith','Technology',9500);
insert into employee values(10,'test3.csv','Mayla','Anthony','Technology',3000);
insert into employee values(11,'test3.csv','Mayla','Mark','Technology',3000);
insert into employee values(12,'test4.csv','Mayla','Roberts','Technology',8500);
insert into employee values(13,'test4.csv','Anthony', 'Anderson','Marketing',9500);
insert into employee values(14,'test5.csv','Anthony','Smith','Technology',6000);
insert into employee values(15,'test5.csv','Jan','Thomas','Sales',5000);
insert into employee values(16,'test5.csv','Jan','Michael','Sales',200);
insert into employee values(17,'test5.csv','Thomas','John','Technology',12900);
insert into employee values(18,'test1.csv','Jan','Michael','Sales',8000);
insert into employee values(19,'test1.csv','Thomas','John','Technology',1540);
insert into employee values(20,'test2.csv','Mayla','Smith','Technology',10500);
insert into employee values(21,'test3.csv','Mayla','Anthony','Technology',5600);
insert into employee values(22,'test4.csv','Anthony', 'Anderson','Marketing',2500);
insert into employee values(23,'test5.csv','Brian','Earl','HR',1200);
insert into employee values(24,'test5.csv','John','Smith','HR_Sales',2000);
insert into employee values(25,'test6.csv','Jan','Thomas','HR_Sales',12000);
insert into employee values(26,'test6.csv','Jan','Michael','Education',1500);
insert into employee values(27,'test7.csv','Thomas','John','HR_Sales',1000);