0

My Dataset looks something like this. Note below is hypothetical dataset.

Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information

Sr.No. Store_Name Phone-No. Agent_id Area Lat-Long
1 ABC Stores 89099090 121 Bay Area 23.909090,89.878798
2 Wuhan Masks 45453434 122 Santa Fe 24.452134,78.123243
3 Twitter Cafe 67556090 123 Middle East 11.889766,23.334483
4 abc 33445569 121 Santa Cruz 23.345678,89.234213
5 Silver Gym 11004110 234 Worli Sea Link 56.564311, 78.909087
6 CK Clothings 00908876 223 90 th Street 34.445887, 12.887654

Facts: #1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same

In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet

a) Since Name is entered manually for same house/store names can be changed and entered in the system - multiple visits can happen b) Mobile number can also be manipulated, different number can be associated with same outlet

c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building

Problem:

  1. How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.

  2. Deploying QR is not also very helpful as this can also be tweaked.

  3. Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)

Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made

VLAZ
  • 26,331
  • 9
  • 49
  • 67
rajeswa
  • 47
  • 9
  • 2
    I don't think it is wise to try to use lat/lon as a unique identifier, including reasons you mentioned above (e.g., agent moving closer to the building); another reason: they are floating-point, and due to many reasons (https://stackoverflow.com/q/9508518, https://stackoverflow.com/q/588004, and https://en.wikipedia.org/wiki/IEEE_754) tests of strict equality should not be relied on. Having said that, if you have a known "truth" set of coordinates, then you can calculate distance from each of your "observed" data to the "truth" data and use the ID of the closest match for its uniqueness. – r2evans Dec 24 '20 at 14:19
  • 2
    Do you have the real latitude and longitudes of the stores? I take it these readings are from the employees. I would work towards getting logitude and latitude data for each store, and then use a distance measure to assign each of these rows to that store. You could then use a string distance function for further verification on store names like the `adist` function in R or use something more complex if it's an area you're comfortable in. Edit: sorry didn't see the comment above, says the same thing which is reassuring :) – Jonny Phelps Dec 24 '20 at 14:45
  • Thanks @r2evans can you please tell me will this work in a large dataset because shortest distance when assigned to a specific id can have multiple instances right ? as far as I can understand we will have 2 new cols Original Coordinates and Distance observed. – rajeswa Dec 28 '20 at 13:04
  • @JonnyPhelps , Thanks , it will really great if you can explain a bit " then use a distance measure to assign each of these rows to that store " from the comment above . – rajeswa Dec 28 '20 at 13:06
  • True *shortest* distance will only have more than one row if one point is perfectly equidistant from two stations. From Jonny's comment, I suspect the intent is to find the closest store (perhaps within a threshold) for each of your observations above. To know this, you either calculate the distance between *all* observations and *all* stations, or you introduce a simple heuristic to reduce the number of candidate stations for your observations. – r2evans Dec 28 '20 at 13:08
  • @r2evans, Much Thanks – rajeswa Dec 30 '20 at 14:14

0 Answers0