0

I tried to google this but I didn't know how to write it without needing to explain myself.

Basically I have a data set that I am performing some validation on, and I am looking for duplicate data imports from users of the database I am working on.

The reason for this is that the users have been importing building information into the database. In some cases, the name (for example) has been incorrect, and so they have amended the import file and re uploaded the data.

The issue is that rather than replacing the existing building, the importer (understandably) creates new building records. So I need a way to return building records that have a duplicate, so that we can remove the older records.

Obviously this is not perfect, but I'm trying to use common fields (building size, location) to locate duplicates, as these fields would be the same; it's only the names that would have changed. So I if I were to have tables laid out as below, how would I isolate the duplicate rows? (sorry for the imgur link, I don't know how to format text here into a table-like format; the table is very much simplified)

https://i.stack.imgur.com/p9D98.jpg

So you can see that the buildings at LocationID 1 and 4 have only been imported once, but the import for LocationId 2 has been run twice and 3 times for LocationID 3 - these are the records I need to find.

I simply want to know if there is something I can use to isolate rows where a series of columns are duplicated - can anyone help?

Conrad Frix
  • 51,984
  • 12
  • 96
  • 155
Joe Farmer
  • 21
  • 2
  • 3
    http://stackoverflow.com/questions/how-to-ask - not because you did it bad, but you should still read this if this is your first time. – griffin Oct 01 '13 at 14:38
  • "LocationID 1 and 4 have only been imported once, but the import for LocationId 2 has been run twice and 3 times for LocationID 3" as I interpret your picture I would rather say `LocationID was imported three times, LocationID2 two times, LocationID3 six times and LocationID4 four times` how am I reading this wrong? – DrCopyPaste Oct 01 '13 at 14:51
  • Thanks for your responses - the end game is indeed to get rid of the older ones, but it is not part of this script. I need to return the rows with this script to identify which do need removing and which are genuine. Obviously there will be buildings, for example, that will have the same area (because they are standard sized cabins, e.g.) so I will need to manually exclude these. But I need in the first instance to identify these rows. – Joe Farmer Oct 01 '13 at 14:54
  • @DrCopyPaste They seem to mean the locationID/BuildingArea combinations. Joe, is what you want to select all rows that have duplicates, along with said suplicates? (So only select a row if there is another with the same essential information)? – Amber Oct 01 '13 at 14:56
  • @DrCopyPaste - Sorry, I may not have explained myself fully; there are multiple buildings at each location. So LocationID 1 has 3 buildings (A, B and C), whereas Location 2 has only 1 building (A) which has been imported twice. Does this make sense? – Joe Farmer Oct 01 '13 at 14:58
  • yea, now I get it, you actually do explain it in your post but I probably misread it (which is easy to do though imho :D) – DrCopyPaste Oct 01 '13 at 15:04
  • @DrCopyPaste no worries :) – Joe Farmer Oct 01 '13 at 15:08

2 Answers2

1

This gives you the duplicates:

select *
from TheTable
where (BuildingNo, BuildingArea, LocationId) in
(
  select BuildingNo, BuildingArea, LocationId
  from TheTable
  group by BuildingNo, BuildingArea, LocationId
  having count(*) > 1
);

What database system do you use? This works in Oracle. It should work in other dbms, too.

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73
0

Assuming you have a primary key that auto-increments in SQL Server (i.e. set an identify seed), you can write the following query to identify the earliest record in the database for which there is more than one record with identical building, size and location data:

SELECT building, size, location, MIN(ID)
FROM LoadedTableData
GROUP BY building size, location
HAVING COUNT(*) > 1

Correspondingly, you could delete these rows as follows:

DELETE FROM LoadedTableData
WHERE ID IN (
  SELECT MIN(ID) as MIN_ID
  FROM LoadedTableData
  GROUP BY building size, location
  HAVING COUNT(*) > 1
)
Andriy M
  • 76,112
  • 17
  • 94
  • 154
  • If there are three or more duplicates, your DELETE statement will not eliminate all of them. However, you could remove the HAVING and revert the IN to NOT IN. That way only one entry per `(building size, location)` would remain. – Andriy M Oct 01 '13 at 16:26
  • Good Point Andriy, Thanks :) –  Oct 02 '13 at 13:18