2

I have some data that looks like this:

ID      lat      long     university   date        cat2    cat3   cat4   ...
00001   32.001   -64.001  MIT          2011-07-01  xyz     foo    NA     ...
00002   45.783   67.672   Harvard      2011-07-01  abc     NA     lion   ...
00003   54.823   78.762   Stanford     2011-07-01  xyz     bar    NA     ...
00004   76.782   23.989   IIT Bombay   2011-07-02  NA      foo    NA     ...
00005   32.010   -64.010  NA           2011-07-02  NA      NA     hamster...
00006   32.020   -64.020  NA           2011-07-03  NA      NA     NA     ...
00006   45.793   67.700   NA           2011-08-01  NA      bar    badger ...

I want to impute missing values for the university column based on the lat-long coordinates. This is obviously made up, as the data's 500K rows and rather sparse on the university column. Imputation packages like Amelia seem to want to fit numerical data according to a linear model and zoo seems to want to fill in missing values based on some sort of ordered series, which I don't have. I want to match close lat-longs, not just exact lat-long pairs, so I can't just fill in one column by matching values from another.

I plan to approach the problem by finding all the lat-long pairs associated with a university, draw a bounding box around them, then for all rows with lat-long pairs but missing university data, add the appropriate value for university depending on which lat-long box they're in, or perhaps within a certain radius of the midpoint of the known locations.

Has anyone ever done something similar? Are there any packages that make it easier to group geographically proximate lat-long pairs or maybe even to do geographically-based imputation?

If that works, I'd like to take a crack at imputing some of the other missing values based on existing value in the data (like 90% of rows with xyz, foo, Harvard values also have lion in the 4th category, so we can impute some missing values for cat4) but that's another question and I would imagine a much harder one, which I might not even have enough data to do successfully.

William Gunn
  • 2,925
  • 8
  • 26
  • 22
  • Would you mind doing a dput(datas) for us? – Rguy Nov 11 '11 at 18:33
  • The simplest route would probably be to just impute using a knn classifier. – joran Nov 11 '11 at 18:35
  • 1
    Also, a simple Euclidean distance should do the trick. Take any know Latitude/longitude coordinates for a particular university, and assign them as THE coordinates for the university. This data set should have exactly Nx2 entries, where N = length(unique(datas$university)). Then, take the euclidean distance (in 2 dimensions) between each unclassified entry and the Nx2 data set. The entry with the minimum distance will be the university you assign to the unclassified lat/lon pair – Rguy Nov 11 '11 at 18:47
  • 1
    I would add to Rguy's suggestion with suggesting that you "start" by finding unique pairs or sets of expected categorical values based on specific locations. Since it's only 2D you could assign a number to each quadrant as a double check to make sure your euclidean distances are proximate to your actual location (rather than another quadrants location) – Brandon Bertelsen Nov 11 '11 at 18:54
  • I provided (somewhat) useful links in this answer http://stackoverflow.com/questions/2613420/handling-missing-incomplete-data-in-r-is-there-function-to-mask-but-not-remove/2994546#2994546 – aL3xa Nov 11 '11 at 19:26
  • I can't wait to see the code for this, but with respect to the lat/long for universities, the data should (keyword: should) be available publicly from the Department of Education. You can download the entire Department of Education higher ed universe from nces.ed.gov/ipeds. Look for Institutional Characteristics survey. HTH. – Btibert3 Nov 11 '11 at 20:22
  • Thanks everyone! I would dput(data) but as I said, it's kinda sparse and I don't think the problem would be represented well here. Thanks also for the idea to use simple euclidean distance from known coordinates. – William Gunn Nov 12 '11 at 00:05
  • Thanks also for your links, aL3xa, I read it before posting and you have good points on how to handle these kinds of problems in general. – William Gunn Nov 12 '11 at 00:16
  • I just got a tip that solr can add a geo-index to data and supports location-based queries so that might be a good approach for this. – William Gunn Nov 12 '11 at 19:48

1 Answers1

2

I don't have a package in mind to solve what you're describing. I've done some similar type analysis and I ended up writing something bespoke.

Just to give you a jumping off point, here's an example of one way of doing a nearest neighbor calculation. Calculating neighbors is kind of slow because, obviously, you have to calculate every point against every other point.

## make some pretend data
n <- 1e4
lat <- rnorm(n)
lon <- rnorm(n)
index <- 1:n
myDf <- data.frame(lat, lon, index)

## create a few helper functions
cartDist <- function(x1, y1, x2, y2){
  ( (x2 - x1)^2 - (y2 - y1)^2 )^.5
}

nearestNeighbors <- function(x1, y1, x2, y2, n=1){
  dists <- cartDist(x1, y1, x2, y2)
  orders <- order(dists)
  index <- which(orders <= n)
  neighborValues <- dists[index]
  return(list(index, neighborValues))
}


## this could be done in an apply statement
## but it's fugly enough as a loop
system.time({
for (i in 1:nrow(myDf)){
  myDf[i,]$nearestNeighbor <- myDf[nearestNeighbors( myDf[i,]$lon, myDf[i,]$lat,  myDf[-i,]$lon, myDf[-i,]$lat )[[1]],]$index
}
})
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • That looks like it would take eons to run on decent sized data but thanks for the nearest neighbor code. That's exactly the kind of thing I was looking for. I think I'm first going to try to get the midpoint of all my known lat-long pairs for unique institutions and try the euclidean distance approach, but will test this too and come back to let you know what I found. – William Gunn Nov 12 '11 at 00:10