4

I have a large data frame (100,000 rows) with LON, LAT, VALUE which i want to convert into a matrix. (coordinates in EPSG:3035).

I tried the reshape2 package with the following command

acast(df, lon~lat, value.var="value") 

which worked wonderfully.

When i transformed the coordinates to 'EPSG:4326' and run the same code, its when i get the error.

str(df1)
data.frame':    168643 obs. of  3 variables:
 $ x: num  28 28.1 27.8 28 28.1 ...
 $ y: num  71.1 71 71 71 71 ...
 $ z: num  0.0893 0.093 0.085 0.0886 0.0924 ...
> aa=acast(df1, x~y, value.var="z")
Error in seq_len(n) : argument must be coercible to non-negative integer
In addition: Warning message:
In match(seq_len(n), overall, nomatch = NA) : NAs introduced by coercion

For a reproducible example like the one given below, the code works, but why is it for large data frame like i have, i am getting the error. Has it got to do anything with the transformation of the coordinates.

x=c(-8.084929925, -8.01229693, -7.939629855, -7.866928803, -7.794193877, -7.721425179,    -7.648622813, -7.575786885, -7.502917498, -7.430014757, -7.357078769, -7.284109638, -7.211107472, -7.138072377, -7.065004461, -6.99190383)


y=c(53.07977473, 53.09085897, 53.10189964, 53.11289671, 53.12385014, 53.1347599, 53.14562596, 53.15644829, 53.16722685, 53.17796162, 53.18865255, 53.19929962, 53.2099028, 53.22046205, 53.23097734, 53.24144865)

z=c(0.065, 0.063, 0.062, 0, 0, 0, 0.061, 0.062, 0.064, 0.06, 0.069, 0.074, 0.079, 0.08, 0.092, 0.10)

df=data.frame(x,y,z)
acast(df, x~y, value.var="z")

Any thoughts?

Nav
  • 135
  • 2
  • 7
  • 1
    Could you make your example reproducible? – Paul Hiemstra Mar 04 '13 at 18:52
  • works fine with these data: `df <- data.frame(lon=sample(10:20),lat=sample(10:20),value=10:20)` – Ben Bolker Mar 04 '13 at 19:02
  • @PaulHiemstra i have added a small reproducible example. – Nav Mar 04 '13 at 19:20
  • There is nothing in that acast()-ed-matrix except diagonal elements. You should not be using a matrix object unless the x and y locations have a reasonable degree of repetition. All of your and y values are distinct. You need to rewind to the beginning and explain what you are trying to do with this data, rather than requesting advice that will fail to acheive any reasonable goals. – IRTFM Mar 04 '13 at 19:42
  • @DWin Thank you. My x and Y locations do have a reasonable degree of repetition. I need to convert the data into a matrix for some modelling purposes as requested by a modeller. – Nav Mar 04 '13 at 19:56

2 Answers2

1

Having been assured that this makes sense given the data, which was not really the case for the example offered, do this:

 mtx <- matrix(NA, nrow=length(unique(df$x)), ncol=length(unique(df$y)) )
 mtx[cbind(order(df$x), order(df$y))] <- df$z

You will lose any information on the x and y distances. The ordered unique values could be added to the matrix dimnames if desired.

dimnames(mtx) <- list( sort(unique(df$x)), sort(unique(df$y) ) )

At the moment they do seem to be reasonably regularly spaces so perhaps not all is lost. You could find that measurement errors require some sort of rounding operation before the determination of what are "unique" values:

> diff(df$x)
 [1] 0.07263300 0.07266708 0.07270105 0.07273493 0.07276870 0.07280237 0.07283593 0.07286939
 [9] 0.07290274 0.07293599 0.07296913 0.07300217 0.07303509 0.07306792 0.07310063
> diff(df$y)
 [1] 0.01108424 0.01104067 0.01099707 0.01095343 0.01090976 0.01086606 0.01082233 0.01077856
 [9] 0.01073477 0.01069093 0.01064707 0.01060318 0.01055925 0.01051529 0.01047131

It looks like there is a systematic increase/creep in the interval distances:

> diff(diff(df$x))
 [1] 3.4080e-05 3.3977e-05 3.3874e-05 3.3772e-05 3.3668e-05 3.3562e-05 3.3459e-05 3.3354e-05
 [9] 3.3247e-05 3.3143e-05 3.3035e-05 3.2929e-05 3.2821e-05 3.2715e-05
> diff(diff(df$y))
 [1] -4.357e-05 -4.360e-05 -4.364e-05 -4.367e-05 -4.370e-05 -4.373e-05 -4.377e-05 -4.379e-05
 [9] -4.384e-05 -4.386e-05 -4.389e-05 -4.393e-05 -4.396e-05 -4.398e-05
IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

If the dataset isn't too big, and the LAT and LON variables are integers, you might want to try loading in the matrix with a for loop. Even at 100k elements, it shouldn't take more than a minute to complete.

n = max(df$LON)
m = max(df$LAT)
x = matrix(0, nrow=m, ncol=n)
for (i in nrows(df)){
    x[df[i,"LAT"], df[i,"LON"]] = df[i,"VALUE"]
}
kith
  • 5,486
  • 1
  • 21
  • 21
  • there are a few problems with this: (1) it will be problematic if LAT/LON are not integers; (2) it will be much slower than other possibilities – Ben Bolker Mar 04 '13 at 19:20
  • for this example i am getting "subscript out of bounds" error. – Nav Mar 04 '13 at 19:26
  • If you have negatives or non integer LAT's and LON's you'll need to map them to the index range. Lets say you want to store the results in a 100x100 matrix. You'll need to do write a function that converts the LAT from a range, say 0 to 360, to a range 1 to 100. Then you can use the above code like: x[lat2ind(df[i,"LAT"]), lon2ind(df[i,"LON"])] = df[i,"VALUE"] – kith Mar 04 '13 at 19:46