2

In R, I know there are many different ways of joining/merging data frames based on an equals-condition between two or several columns.

However, I need to join two data frames based on matching a value to a value-range, defined by 2 columns, using greater-than-or-equal-to in one case and less-than-or-equal-to in the other. If I was using SQL, the query could be:

SELECT * FROM Table1,
LEFT JOIN Table2
ON Table1.Value >= Table2.LowLimit AND Table1.Value <= Table2.HighLimit

I know about the sqldf package, but I would like to avoid using that if possible.

The data I am working with is one data frame with ip-addresses, like so:

ipaddresses <- data.frame(IPAddress=c("1.1.1.1","2.2.2.2","3.3.3.3","4.4.4.4"))

The other data frame is the MaxMind geolite2 database, containing an ip-address range start, and ip-address range end, and a geographic location ID:

ip_range_start <- c("1.1.1.0","3.3.3.0")
ip_range_end <- c("1.1.1.255","3.3.3.100")
geolocationid <- c("12345","67890")
ipranges <- data.frame(ip_range_start,ip_range_end,geolocationid)

So, what I need to achieve is a join of ipranges$geolocationid onto ipaddresses, in each case where

ipaddresses$IPAddress >= ipranges$ip_range_start 
AND 
ipaddresses$IPAddress <= ipranges$ip_range_end

With the example data above, that means I need to correctly find that 1.1.1.1 is in the range of 1.1.1.0-1.1.1.255, and 3.3.3.3 is in the range of 3.3.3.0-3.3.3.100.

stenevang
  • 116
  • 7

3 Answers3

0

This approach may not scale well, because it involves initially doing an outer join via broom::inflate(), but it should work if you don't have a ton of ipaddresses:

library(dplyr)
library(broom)

ipranges %>%
  inflate(ipaddresses) %>%
  ungroup %>%
  filter(
    numeric_version(IPAddress) >=  numeric_version(ip_range_start),
    numeric_version(IPAddress) <= numeric_version(ip_range_end)
  )

Results

Source: local data frame [2 x 4]

  IPAddress ip_range_start ip_range_end geolocationid
     (fctr)         (fctr)       (fctr)        (fctr)
1   1.1.1.1        1.1.1.0    1.1.1.255         12345
2   3.3.3.3        3.3.3.0    3.3.3.100         67890
davechilders
  • 8,693
  • 2
  • 18
  • 18
0

Having done some additional research, I have actually found a solution for my particular use case. Still, it is NOT a solution to the general problem: How to join two data frames where the join condition is that key >= value1 AND key <= value2. However, it does solve the actual problem I had.

What I ended up finding as a great way to solve my need for geographic location of ip-addresses, is the package rgeolocate in combination with the downloadable binary version of the MaxMind GeoLite2 database.

The solution is lightning-fast; the matching of 500+ ip-addresses to 3+ million ip-ranges is done in a second. My previous attempt involved loaded the CSV-version of the MaxMind database into a data frame and work from there. Don't do that. Thanks to the rgeolocate package and the binary MaxMind database, it is SO much faster.

My code ended up being this (dataunion is the name of my data frame where I have my collected ip-addresses)

library(rgeolocate)

ipaddresslist <- as.character(dataunion$IPAddress)
geoloc <- maxmind(ipaddresslist, "GeoLite2-City.mmdb", c("latitude","longitude", "continent_name","country_name","region_name","city_name"))
colnames(geoloc) <- c("Lat","Long","Continent","Country","Region","City")
dataunion <- cbind(dataunion, geoloc)
stenevang
  • 116
  • 7
0

Finally, I have found the solution for the general problem, in addition to the above solution to the specific problem of geolocating IP-addresses using the MaxMind database.

This is the general solution for joining two data frames of equal or unequal length, where a value must be compared with an inequality condition (less-than or greater-than) to one or more columns.

The solution is using sapply, which is base R.

With the two data frames defined in the question, iprangesand ipaddresses, we have:

ipaddresses$geolocationid <- sapply(ipaddresses$IPAddress, 
    function(x) 
    ipranges$geolocationid[ipranges$ip_range_start <= x & ipranges$ip_range_end >= x])

What sapply does is it takes each element, one at a time, from the vector ipaddresses$IPAddressand applies it to the function expression provided as an argument to sapply. The result element of applying the function to each element is appended to a vector, which is the output result of sapply. And that is what we insert as a new column into ipaddresses$geolocationid.

In this case, if the IP-addresses are converted to integers first, the sapply operation probably gets faster. Here are a few lines that will extend the ipaddresses data frame with a column containing the integer version of each ip-address:

#calculating the integer version of each IP-address
octet <- data.frame(read.table(text=as.character(ipaddresses$IPAddress), sep="."))
octet$IPint <- 256^3*octet[,1] + 256^2*octet[,2] + 256*octet[,3] + octet[,4]
ipaddresses$IPint <- octet$IPint
# cleaning "octet" from memory
octet <- NULL

You would obviously have to do the same kind of conversion to the IP-addresses in your ipranges dataframe.

stenevang
  • 116
  • 7