1

I am facing a problem when trying to subset my data, maybe you could help me. What I need is to subset data from first data frame by a column when this column value is equal to the value of a column in the second data frame.

The following are the dataframes I'm using:

> head(places)
  Zona   Poble     lat       lon      alt
1    1  Zorita 40.7353 -0.165748  691.867
2    1 Morella 40.6287 -0.113284  955.719
3    1 Forcall 40.6621 -0.209759  753.882
4    2 Benasal 40.3943 -0.126111  848.171
5    2    Cati 40.4532  0.060409  667.610
6    2  Fredes 40.7079  0.167981 1194.730

> head(data)
      date   time stat_id     lat     lon    tempc
1 20121122 000000       1 40.7353 -0.1657  7.98737
2 20121122 000000       2 40.6287 -0.1133  6.49903
3 20121122 000000       3 40.6621 -0.2098  7.72955
4 20121122 000000       4 40.3943 -0.1261  7.98837
5 20121122 000000       5 40.4532  0.0604 10.35480
6 20121122 000000       6 40.7079  0.1680  6.00769

As you can see, three first places in dataframe "places" belong to Zona == 1 and share lat/lon with three first rows in dataframe "data". I would like to select rows in data that share lat/lon with Zona == i on places.dat.

The R script I am trying is

datos=read.table("data.dat",header=T)
places=read.table("places.dat",header=T)

data=as.data.frame(datos)
place=as.data.frame(pobles)

data$time[data$time == 0] = "000000"

subset(data,data$lat == place$lat[place$Zona == 1])

So, subset would show three rows for each time in data.dat but it is only selecting two of three, as it follows

> subset(data,data$lat == place$lat[place$Zona == 1])
         date   time stat_id     lat     lon    tempc
1    20121122 000000       1 40.7353 -0.1657  7.98737
2    20121122 000000       2 40.6287 -0.1133  6.49903
385  20121122  30000       1 40.7353 -0.1657  7.00632
386  20121122  30000       2 40.6287 -0.1133  4.83684
769  20121122  60000       1 40.7353 -0.1657  6.55283
770  20121122  60000       2 40.6287 -0.1133  4.85467
1153 20121122  90000       1 40.7353 -0.1657  6.35216
1154 20121122  90000       2 40.6287 -0.1133  5.66342
1537 20121122 120000       1 40.7353 -0.1657 11.47750
1538 20121122 120000       2 40.6287 -0.1133 10.30310
1921 20121122 150000       1 40.7353 -0.1657 13.87090
1922 20121122 150000       2 40.6287 -0.1133 11.90640
2305 20121122 180000       1 40.7353 -0.1657 10.30840
2306 20121122 180000       2 40.6287 -0.1133  7.61322
2689 20121122 210000       1 40.7353 -0.1657  6.29745
2690 20121122 210000       2 40.6287 -0.1133  6.63173
3073 20121123 000000       1 40.7353 -0.1657  4.78633
3074 20121123 000000       2 40.6287 -0.1133  5.31070
3457 20121123  30000       1 40.7353 -0.1657  6.84001
3458 20121123  30000       2 40.6287 -0.1133  6.88369
3841 20121123  60000       1 40.7353 -0.1657  5.71790

For sure I'm missing something, could you help me? Any idea or hint will be appreciated.

Thanks

Data files are available here:

EDIT Following answer from @A.R I tried this code to select data but not sure if it is just the exact way.

for(i in 1:128) {
  for(j in 1:2) {
    a=sqrt((place$lat[i]-datos$lat[j])^2+(place$lon[i]-datos$lon[j])^2)
    n=which.min(a)
    while(n <= 9344) {
      b=cbind(i,n,datos$tempc[n],place$Zona[i])
      n=n+128
    }
  }
}

and get:

> b
       i    n           
[1,] 128 9217 10.1198 30

it gives just the value for the last i value, I would like to save all. Sure it is a basic but I can't figure out, please be patient as I'm not a experienced R user. Thanks again

pacomet
  • 5,011
  • 12
  • 59
  • 111
  • Not sure what you're trying to do, but perhaps you should merge() first and then subset? Also are you trying to match on a floating point number (Lat / Long)? Refer http://www.hep.by/gnu/r-patched/r-faq/R-FAQ_82.html – Sean Nov 26 '12 at 14:43
  • Hi @Sean and thanks. I think merging is not a good idea as places has 128 rows (locations) and data.dat has 128 locations x 72 hours. Maybe the issue is the floating point – pacomet Nov 27 '12 at 15:24

1 Answers1

2

first you need to round the decimals of places lon to 4 digits. Probably this is the reason why you are having problems:

places=read.table("places.dat",header=T)
places=round(places$lon,digits=4)


datos[which((datos$lat==places$lat & datos$lon==places$lon) & places$Zona==1),]

The result for this condition is a total of 146 points.

Edit 1 (following a comment by Sean)

I assumed in my anwswer that in places, the lat was rounded and long not.

But as was pointed out by Sean,comparing floats is not a good idea. It's better to calculate the distance between each places point and datos point, and select the one with the smallest distance, bellow a minimum distance (e.g. half of the distance between the points in datos), as the matching one.

Edit 2

Try something like this:

b=matrix(nrow=dim(places)[1],ncol=5)
a=c()
data.p=c()
n=c()
for(i in 1:dim(places)[1]) {
  for(j in 1:dim(data)[1]) {
    a[j]=sqrt((places$lat[i]-data$lat[j])^2+(places$lon[i]-data$lon[j])^2)
  }   
  data.p[i]=which.min(a)
  n[i]=min(a)
}
b=cbind(places=1:(dim(places)[1]),data=data.p,distance=n,tempc=data$tempc[data.p],Zona=places$Zona)

than do some queries:

b[which(b[,3]<1),]
b[which(b[,3]<0.00001),]
Gago-Silva
  • 1,873
  • 4
  • 22
  • 46
  • I thought that the decimals could be the problem. Then I will have to round lat and lon decimals to 4 digits so they have same dimensions before subsetting. I'll try tomorrow at work and see. Thanks – pacomet Nov 27 '12 at 15:27
  • 1
    sorry, but comparing two floating point numbers for equality is a very bad idea even after using round-refer R-FAQ "7.31 Why doesn't R think these numbers are equal?" - if you insist on comparing these, why not convert to character-strings and you might as well combine lat & long into one string... but beware... – Sean Nov 27 '12 at 17:38
  • Sean it depends a lot on the data itself and how the coordinates were obtained (including their precision), and what the user will do with this comparison. I would expect that inside the same dataset(places) to have lat and long with the same amount of decimal numbers, this is not happening and one can assume that lat was round to 4 digits and long not, etc – Gago-Silva Nov 27 '12 at 17:51
  • Maybe I should use dist() to find distances between a point in places and all points in datos. Then I could assign a new column to datos with places$Zona that is what I need to find max/min values of datos$tempc for each places$Zona (Zona means area in Spanish). It's my first time trying to calculate a distance in R, is dist() the right command? – pacomet Nov 28 '12 at 10:17
  • what you can do is for each i point in places and all j points in datos calculate the distance using: sqrt((places$lat[i]-datos$lat[j])^2+(places$lon[i]-datos$lon[j])) . This means that for each i you will have a list of distances with the same length of datos. After you need to create a minimum distance threshold to avoid selecting points that are too far away when they have no close neighbors. I am not sure if dist() will be able to do what you need here. – Gago-Silva Nov 28 '12 at 12:06
  • Thanks @A.R, your hint to calculate distance was fine. – pacomet Dec 05 '12 at 18:07
  • Hmm, I'm not able to find the point. Please take a look to the code I'm adding in the question. I'm not an experienced user and maybe I'm loosing something – pacomet Dec 05 '12 at 18:44