1

I have a data frame (df) that includes latitude and longitude coordinates (Lat, Long) as well as the depth (Depth) of a temperature measurement for each entry. Essentially, each entry has (x,y,z)=(Lat,Long,Depth) locational information for each temperature measurement.

I'm trying to clean the data by finding and removing duplicate measurement locations. The easy part is removing the exact duplicates, handled as such:

df = df[!(duplicated(df$Lat) & duplicated(df$Long) & duplicated(df$Depth)),]

However the problem is that the values of lat/long for some entries are just slightly off, meaning the above code won't catch them but they are still clearly duplicated (e.g. lat = 39.252880 & lat = 39.252887).

Is there a way to find duplicates that are within a certain absolute value or percentage of the first instance?

I appreciate any help, thanks!

Kevin M
  • 801
  • 3
  • 9
  • 14
  • 4
    You could round all the values to whatever tolerance you prefer and then look for duplicates. In your example, rounding to three decimal places would give latitude = 39.253 for both points. Or maybe it would make more sense to calculate the spatial distance between each of the measurements and find groups of points clustered within some critical distance of each other. Then, within those clusters, find rows with the same depth and average them. See [this SO answer](http://stackoverflow.com/a/21006437/496488) for how to find the spatial distances and identify clusters of points. – eipi10 Aug 02 '16 at 22:22
  • For the first method, you can shorten your code: `df.new = df[!duplicated(round(df[,c("lat","long","depth")], 3)), ]`. – eipi10 Aug 02 '16 at 22:32

1 Answers1

0

Based on this post I was able to come up with a solution. I modified the function to have a tighter tolerance on "duplicates" to be 0.001, otherwise the function is unchanged. The application, however, changes slightly to:

output=data.frame(apply(dataDF,2,fun))

because I wanted to compare values within a single column instead of in a single row.

To continue, I then add artificial indices to my output data frame for later use:

output$ind = 1:nrow(output)

The main part is finding the row indices where the function returned TRUE for the three locational information fields (lat, long, depth). The following code finds the indices where all three were true, creates a temporary data frame with only those entries (still logicals), finds the indices of the duplicates, then reverses it to return the indices that will be removed from the full, original dataset (pardon the bad variable names):

ind = which(with(output,(Lat=='TRUE' & Long=='TRUE' & Depth=='TRUE')))
temp = dataDF[ind,]
temp$ind = ind
ind2 = duplicated(temp[,1:3])
rem = ind[ind2]

df.final = dataDF[-rem,]

Hopefully that helps! It's a bit complicated and the function is very slow for large datasets but it gets the job done.

Community
  • 1
  • 1
Kevin M
  • 801
  • 3
  • 9
  • 14