How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

Question

My Dataset looks something like this. Note below is hypothetical dataset.

Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information

Sr.No.	Store_Name	Phone-No.	Agent_id	Area	Lat-Long
1	ABC Stores	89099090	121	Bay Area	23.909090,89.878798
2	Wuhan Masks	45453434	122	Santa Fe	24.452134,78.123243
3	Twitter Cafe	67556090	123	Middle East	11.889766,23.334483
4	abc	33445569	121	Santa Cruz	23.345678,89.234213
5	Silver Gym	11004110	234	Worli Sea Link	56.564311, 78.909087
6	CK Clothings	00908876	223	90 th Street	34.445887, 12.887654

Facts: #1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same

In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet

a) Since Name is entered manually for same house/store names can be changed and entered in the system - multiple visits can happen b) Mobile number can also be manipulated, different number can be associated with same outlet

c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building

Problem:

How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)

Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

I don't think it is wise to try to use lat/lon as a unique identifier, including reasons you mentioned above (e.g., agent moving closer to the building); another reason: they are floating-point, and due to many reasons (https://stackoverflow.com/q/9508518, https://stackoverflow.com/q/588004, and https://en.wikipedia.org/wiki/IEEE_754) tests of strict equality should not be relied on. Having said that, if you have a known "truth" set of coordinates, then you can calculate distance from each of your "observed" data to the "truth" data and use the ID of the closest match for its uniqueness. — r2evans, Dec 24 '20 at 14:19
Do you have the real latitude and longitudes of the stores? I take it these readings are from the employees. I would work towards getting logitude and latitude data for each store, and then use a distance measure to assign each of these rows to that store. You could then use a string distance function for further verification on store names like the `adist` function in R or use something more complex if it's an area you're comfortable in. Edit: sorry didn't see the comment above, says the same thing which is reassuring :) — Jonny Phelps, Dec 24 '20 at 14:45
Thanks @r2evans can you please tell me will this work in a large dataset because shortest distance when assigned to a specific id can have multiple instances right ? as far as I can understand we will have 2 new cols Original Coordinates and Distance observed. — rajeswa, Dec 28 '20 at 13:04
@JonnyPhelps , Thanks , it will really great if you can explain a bit " then use a distance measure to assign each of these rows to that store " from the comment above . — rajeswa, Dec 28 '20 at 13:06
True *shortest* distance will only have more than one row if one point is perfectly equidistant from two stations. From Jonny's comment, I suspect the intent is to find the closest store (perhaps within a threshold) for each of your observations above. To know this, you either calculate the distance between *all* observations and *all* stations, or you introduce a simple heuristic to reduce the number of candidate stations for your observations. — r2evans, Dec 28 '20 at 13:08

How to find duplicate using Lat Long data and make it a Unique Identifier in big dataset

0 Answers0