0

I am loading the Gowalla dataset in R available at the stanford repository and renaming the column names. https://snap.stanford.edu/data/loc-gowalla.html

Gowalla<-read.csv(file = "Gowalla_edges.txt", sep="\t", header=FALSE)
colnames(Gowalla)<-c("uid", "utc", "lat", "long", "vid")

My aim is to select the rows which contain latitudes and longitudes within Lodon city. The bounding box in terms of latitudes and longitudes is given at http://www.mapdevelopers.com/geocode_bounding_box.php

You can visit and search for bounding box for london and it gives you the range of latitudes and longitudes.

Now when i search in R for a specific latitude say for example

which(Gowalla$lat == 30.23591) 

It returns null where as it is the very first latitude in the data!

However if i search for vid which is an integer and not a decimal like latitude

which(Gowalla$vid==22847)

it gives me the row numbers for that value.

So my question is why can't i search for latitudes and longitudes using "which" function and why gowalla returns null in my case?

Once i find the answer to this I can using if-else and search for rows which fall in my london's bounding box. Is there any efficient method of searching for rows which fall in the london's bounding box?

The bounding box for london is between Latitudes 51.672343 and 51.384940 and Longitudes 0.148271 Longitudes -0.351468

Thanks.

Asad Feroz Ali
  • 362
  • 5
  • 15
  • 2
    `==` shouldn't be used to search for for floating point values. You should use `which(abs(Gowalla$lat - 30.23591) <= 0.00000001)` where `0.00000001` is your desired tolerance – digEmAll Mar 05 '16 at 10:51
  • 2
    It is never a good idea to compare float numbers with `==`. Use `all.equal()` instead. For more information see this [all-time classic SO question](http://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal). – RHertel Mar 05 '16 at 10:52
  • 1
    When `R` prints `numeric` values, it rounds them. So `30.23591` is not the actual value, but a rounded version. If you try `which(Gowalla$lat == Gowalla$lat[1])` you'll receive a not empty result. To select data inside a box, try `lat< 51.672343 & lat >51.384940 & lon<0.148271 & lon>-0.351468` (I omitted the `Gowalla$` part). – nicola Mar 05 '16 at 10:54
  • 1
    Wouldn't it be easier to convert the bounding box a to polygon and use `rgeos` package to find points within the polygon? Or perhaps `sp::points.in.polygon`? – Roman Luštrik Mar 05 '16 at 10:57
  • So to avoid complexity, can I multiply all the rows of lats and longs with `1000000` and convert all the data to integers and then after performing calculations I can divide again and get back my lats and longs? It would be much less of a headache! – Asad Feroz Ali Mar 05 '16 at 11:18
  • @digEmAll thanks. It works. So how do i enter a range in that expression you mentioned. Like searching that column `Gowalla$lat` for values between `51.672343` and `51.384940` ? Also can you link me to some documentation which explains the function you used so I can refer to it for different searches. Thanks a lot. – Asad Feroz Ali Mar 05 '16 at 11:29
  • @AsadFerozAli: `Gowalla$lat <= (51.672343 + tolerance) & Gowalla$lat >= (51.384940 - tolerance)` where tolerance is something like `0.00000001` – digEmAll Mar 05 '16 at 11:42
  • 1
    @RHertel: what you said is not completely true. R does have integers (only 32bit) so you can convert a numeric (=double 64bit) to integer using `as.integer` function – digEmAll Mar 05 '16 at 11:44
  • @digEmAll Thank you; I agree. Else it would hardly make sense to add an `L` to integers, as is often done. I will remove the comment. – RHertel Mar 05 '16 at 11:46
  • @RHertel: anyway, it is sadly true that you can store relatively small values into integers since unfortunately R does not support 64 bit integers... – digEmAll Mar 05 '16 at 13:03

1 Answers1

4

Try to search the index using

which(sapply(Gowalla$lat, all.equal,30.23591)==TRUE)

As explained in the answers to this question, the pitfalls of floating point arithmetics can lead to counterintuitive results. The function all.equal() is tailored to capture such cases. It returns TRUE if the equality is fulfilled within the limits of the computational accuracy. However, since it returns the difference in a rather verbose manner in the cases where the numbers are not essentially equal, we need to explicitly check it the output is equal to TRUE in order to filter only the results where this assertion is satisfied.


As pointed out by @digEmAll, another approach, which seems to be more promising in this case, consists in introducing a user defined error margin or tolerance, like:

tol <- 1.e-4

Then we can check whether the value we are looking for is within this margin of error by using

which(abs(Gowalla$lat - 30.23591) < tol)

We need the function abs() here because the magnitude of the difference is important, and not its sign. The larger tol is chosen, the more values are likely to be selected.


In the example of London mentioned at the end of the OP, one might use two different tol values, one for lon and on lat:

tol_lat <- 1.01 * (51.672343 - 51.384940) / 2 # half of the latitude range of region of interest, plus 1%
tol_lon <- 1.01 * (0.148271 + 0.351468) / 2 # same for longitudinal values

and define the central values as

lat_c <- (51.672343 + 51.384940) / 2
lon_c <- (0.148271 - 0.351468) /2

Finally, one may check the values in the data frame with

which(abs(Gowalla$lat - lat_c) < tol_lat & abs(Gowalla$long - lon_c) < tol_lon)

As a final note, the standard representation of numbers in R comprises 7 digits, which can be close or beyond the limit of what is being tested. It may therefore be useful to define

options(digits=19)

at the beginning of the script, especially if tol is chosen to be small, near or below 1e-7.


Thanks to @nicola for pointing out a mistake in a previous version of this answer.

Community
  • 1
  • 1
RHertel
  • 23,412
  • 5
  • 38
  • 64
  • Thanks for the explanation. Apart from a typo `)` missing in the code, I tried it and it gives me `integer(0)`! So to avoid complexity, can I multiply all the rows of lats and longs with `1000000` and convert all the data to integers and then after performing calculations I can divide again and get back my lats and longs? It would be much less of a headache! – Asad Feroz Ali Mar 05 '16 at 11:21
  • 1
    I checked your file and the lat value of the first entry is `30.2359091167`. So there is quite a difference between that number and `30.23591`, and it is normal that `all.equal()` will not return `TRUE` in this case. You can try to use `options(digits=19)` to display the number in more detail; or use the approach described by @digEmAll - introducing a personal accuracy threshold. – RHertel Mar 05 '16 at 11:30
  • Yes you are correct thanks. So could you edit your solution a bit so I can search for the rows which fall in the bounding box for London as mentioned in my query please? Thanks a ton. – Asad Feroz Ali Mar 05 '16 at 11:40
  • I think that you can use the approach described in the comments by @digEmAll. I wouldn't know how to edit my answer without copying his/her contribution. – RHertel Mar 05 '16 at 11:49
  • 1
    @RHertel: there's no problem, use my code freely to make your answer complete ;) – digEmAll Mar 05 '16 at 13:04
  • 1
    @digEmAll Thanks a lot. I edited the answer and gave you the credits for the solution including the tolerance. – RHertel Mar 05 '16 at 15:20
  • 1
    Thanks a lot @RHertel – Asad Feroz Ali Mar 05 '16 at 15:42
  • Hope this helps. I added an example based on the London data. I wish you good luck with your project, and please don't hesitate to ask if you have any difficulty with the code. – RHertel Mar 05 '16 at 15:58