0

I have two unequally sized data frames called OBS and REF. Both contain 9864 columns - columns 1 and 2 are the longitude and latitude, and columns 3-9864 are daily temperature values for 27 years. They have vastly different numbers of rows - OBS has 12375 and REF has just 504.

OBS: 12375 obs. of 9864 variables

Lon     Lat     1979.01.01     1979.01.02     1979.01.03
0.000   39.75   13.69          13.14          13.32
1.000   39.75   12.93          12.41          12.59
2.000   39.75   11.78          10.62          11.15
3.000   39.75   11.73          10.94          12.16

REF: 504 obs. of 9864 variables

Lon      Lat       1979.01.01     1979.01.02     1979.01.03
0.0000   37.6559   12.69          12.35          12.60
2.8125   37.6559   13.43          12.97          13.23
5.6250   37.6559   13.91          13.64          13.71
8.4375   37.6559   14.12          14.24          14.01

What I want to do is output a new data frame (REF_Closest) that takes the coordinates from the OBS data frame, finds the closest coordinates from the REF data frame, and outputs these coordinates and all the variables to the right of this, e.g.:

REF_Closest: 12375 obs. of 9864 variables

Lon      Lat       1979.01.01     1979.01.02     1979.01.03
0.0000   37.6559   12.69          12.35          12.60
0.0000   37.6559   12.69          12.35          12.60
2.8125   37.6559   13.43          12.97          13.23
2.8125   37.6559   13.43          12.97          13.23

I have tried adapting other similar questions but cannot work this out. Any suggestions?

DJ-AFC
  • 569
  • 6
  • 16
  • I don't understand from your question why REF_Closest has 9864 rows, or why the first two rows are repeated? – Stephen Henderson Jul 11 '19 at 13:25
  • REF_Closest has 9864 columns rather than rows. The first two rows are repeated because the closest coordinates to rows 1 and 2 of the OBS data frame are from row 1 from the REF data frame. The closest coordinates to rows 3 and 4 of the OBS data frame are from row 3 from the REF data frame etc. – DJ-AFC Jul 11 '19 at 13:28
  • OK if I understand you correctly this probably has a solution: https://stackoverflow.com/q/27321856/1527403 – Stephen Henderson Jul 11 '19 at 13:41

2 Answers2

1

I came up with a data.table solution:

library(data.table)

# we will use this dummy variable to group by
ref[, id := 1:.N]

ref[, cbind(LonRef = Lon, 
      LatRef = Lat, 
      obs[which.min(as.matrix(dist(rbind(ref[.GRP, 1:2], 
                                         obs[, 1:2])))[2:(.N+1), 1]), ]), 
      by = id]

What is in there:

  • cbind binds the columns of three different sources:
    • Ref's lat
    • Ref's lon
    • and a data.frame (data.table too) with the observed readings.
  • That data.frame is calculated as follows:
    • Calculate the ecuclidean distance with dist from ref's current row (hence the .GRP) to all lat-lon pairs in obs (hence the rbind).
    • identify the minimum of the first column, excluding itself (henche the (2:.N))
    • Return the full row of obs for such observed minimum distance (that's the obs[...)

After which I obtained:

   id LonRef  LatRef Lon   Lat 1979.01.01 1979.01.02 1979.01.03
1:  1 0.0000 37.6559   0 39.75      13.69      13.14      13.32
2:  2 2.8125 37.6559   3 39.75      11.73      10.94      12.16
3:  3 5.6250 37.6559   3 39.75      11.73      10.94      12.16
4:  4 8.4375 37.6559   3 39.75      11.73      10.94      12.16

I used data:

obs <- fread("Lon     Lat     1979.01.01     1979.01.02     1979.01.03
 0.000   39.75   13.69          13.14          13.32
 1.000   39.75   12.93          12.41          12.59
 2.000   39.75   11.78          10.62          11.15
 3.000   39.75   11.73          10.94          12.16")

ref <- fread("Lon      Lat       1979.01.01     1979.01.02     1979.01.03
 0.0000   37.6559   12.69          12.35          12.60
 2.8125   37.6559   13.43          12.97          13.23
 5.6250   37.6559   13.91          13.64          13.71
+ 8.4375   37.6559   14.12          14.24          14.01")
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • Many thanks for this elegant solution. It gives me the following error: Error in `:=`(id, 1:.N) : Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways. See help(":="). – DJ-AFC Jul 11 '19 at 17:21
  • That's because your OBS and REF variables aren't data.table. Do `setDT(OBS) ` and `setDT(REF)` and try again – PavoDive Jul 11 '19 at 17:29
  • Yes that was it thanks. I think this solution will work but the script has been running for several hours now and is yet to finish. – DJ-AFC Jul 11 '19 at 22:26
  • Check package `RANN`. Maybe the `nn2` function can be of help. The overall strategy I would use is: 1. assign IDs to both `OBS` and `REF`. 2. find the closest coordinate, using `RANN::nn2`, and get the corresponding IDs in a `result` data.table. 3. Left join `result` to `REF`' and assign to `result`. Hope it helps – PavoDive Jul 12 '19 at 02:55
  • Everything is running so slowly because my data is structured in data frames. Would switching to matrix format be much more efficient? – DJ-AFC Jul 15 '19 at 09:27
  • I'm not sure, try a sample of 100 rows or so, and see for yourself. I think what makes your data run slow is the huge amount of columns. To me, it seems you could `melt` your data to be in long form. Also, try adding IDs to both `OBS` `REF`, so you can save the matching pairs for future use (it will only take a couple joins then) – PavoDive Jul 15 '19 at 09:47
0

With a slight alteration - adding an extra row = this will give you the row numbers in OBS that have closest (euclidean) distance to each row of REF (i.e. its redundant).

> REF
     Lon     Lat X1979.01.01 X1979.01.02 X1979.01.03
1 0.0000 37.6559       12.69       12.35       12.60
2 2.8125 37.6559       13.43       12.97       13.23
3 5.6250 37.6559       13.91       13.64       13.71
4 8.4375 37.6559       14.12       14.24       14.01
> OBS
  Lon   Lat X1979.01.01 X1979.01.02 X1979.01.03
1   0 39.75       13.69       13.14       13.32
2   1 39.75       12.93       12.41       12.59
3   2 39.75       11.78       10.62       11.15
4   3 39.75       11.73       10.94       12.16
5   8 38.50       12.34       14.23       17.23


nearest.rows <- apply (OBS,1, function(OBSrow) 
                       which.min(sqrt((OBSrow[1] - REF$Lon)^2 + (OBSrow[2] - REF$Lat)^2))
                      )

> nearest.rows
[1] 1 1 2 2 4
Stephen Henderson
  • 6,340
  • 3
  • 27
  • 33