2

Say I have a large set of data in R that has variables latitude, longitude, magnitude, and depth (for earthquakes) and I want to create a new data set that includes data for all the variables but only between certain values of latitude and longitude. For example, I want earthquakes that are between 0 and 50 longitude and -20 and 45 latitude (but I want the magnitude and depth to still correspond to the correct longitude and latitudes). Is there an easy way to do this R? For example:

latitude longitude magnitude depth
45        45         1.0        5
-10       -10        4.5        6
-76       12         2.435      18

and I want to choose data where the latitude is between -80 and 0 and the longitude is between 0 and 50, so the only column that would match would be:

latitude, longitude magnitude depth
-76       12         2.435      18

How can I do this?

Didzis Elferts
  • 95,661
  • 14
  • 264
  • 201
user2395969
  • 151
  • 2
  • 17

2 Answers2

1
> #Use [ to extract the rows directly
> #See ?Comparison and ?Arithmetic for the operators
> x[x$latitude > 0 & x$latitude < 80 & x$longitude > 0 & x$longitude < 50, ]
  latitude longitude magnitude depth
1       45        45         1     5
> #Or the slightly more readable subset() function
> subset(x, latitude > 0 & latitude < 80 & longitude > 0 & longitude < 50)
  latitude longitude magnitude depth
1       45        45         1     5
> #see ?Extract or ?subset
> #Also read the help manual for a good intro: http://cran.r-project.org/doc/manuals/R-intro.html
Chase
  • 67,710
  • 18
  • 144
  • 161
  • +1 for subset() which is the best solution I think - readable, as you say – Peter Ellis May 18 '13 at 08:11
  • @PeterEllis - the readability may come at the cost of some unintended consequences. The help page says something like `this is a convenience function for interactive use. For programming it is better to use the standard subsetting functions like [...` [This post](http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset) illustrates why. – Chase May 18 '13 at 14:29
0

You can index your data.frame, say DF as follows:

DF[DF$longitude >= 0 & DF$longitude <= 50 & 
   DF$latitude >= -20 & DF$latitude <=  45, ]

 latitude longitude magnitude depth
       45        45         1     5

Here is a breakdown:

The statements within the [brackets] are indexing the data.frame; more specifically, the rows of the data.frame.

In R you can index using a TRUE/FALSE vector (in addition to other options). There fore we can create a vector that has value TRUE whenever a row is within the geographical bounds and FALSE when outside those bounds.

Defining the bounds ammounts to the four "sides" of your box, ie, asking if the coordinates are above the lowerbound and below the upperbound.

We use the single & operator, as opposed to &&, because we want a unique value for each row. if this last line is unclear, look at the difference between the following:

x <- 1:5
x > 1 &  x < 4

# compare: 
x > 1 && x < 4

data.table solution:

If you'd like to use data.table instead of data.frame, it has a bit of longer learning curve, but it makes for cleaner syntax and quicker work:

library(data.table)
DT <- data.table(DF)

DT[longitude >= 0 & longitude <= 50 & latitude >= -20 & latitude <=  45]
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178