4

I am very new to R and searched about this on forums but couldn't get a close enough solution for this. I am trying to do a mapping between the ip address & corresponding geo locations. I have 2 data sets.

Set-a (1,60,000 rows):
ip(int) | ID(int)

Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)

I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a.

I am doing the following (obviously not a very good way to do this):

ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2

for(i in 1:160000){
  for(j in 1:1674303){
    if(a[i]>ip1[j] & a[i]<ip2[j]) {
                                   a$country[i]<-country[j]
                                   a$area1[i]<-area1[j]
                                   a$area2[i]<-area2[j]}
   }
}

Can someone please tell me an efficient way to do this. This is taking a lot of time. (for i=1 to 100 took some 10 mins to run).

The sample data set-b is:

Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"

It is in continuos increasing order.

The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above)

structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"), 
 Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")

structure(list(Ip1 = c("0", "16777216", "16777472", "16778240", 
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239", 
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN", 
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA", 
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN", 
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE", 
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2", 
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")
  • Maybe you could provide a small example data set? – Frank May 20 '13 at 13:08
  • 2
    This will be easier if the `Ip1` and `Ip2` ranges are different for every country/region in `Set-b`, so can you tell us if that's the case? And even if not, I bet if you first sort your sets so the IP values are in order, it'll lead to a simpler "screening" algorithm. – Carl Witthoft May 20 '13 at 13:20
  • @Frank, Carl : Yes, the ranges are different & continuos increasing order. Sample data set is Ip1, Ip2, Code, Country, Area1, Area2 "16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE" "16777472","16778239","CN","CHINA","FUJIAN","FUZHOU" "16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE" – Gaurav Nemade May 20 '13 at 14:35
  • Please, a sample data set _that we can read in_. :) Also, it would be better to edit your question to add the information. I forgot to mention: you can use `dput(a)` and `dput(b)`... – Frank May 20 '13 at 14:42
  • @Carl, Frank, I am sorry for this. The formatting gets screwed here. I have added the sample data set in the question. – Gaurav Nemade May 20 '13 at 14:46
  • 2
    Welcome to Stack Overflow! As you can see, we're very confused about the state of your data. Please create a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). As Frank mentioned, pasting the output of `dput(head(a))` and `dput(head(b))` would be very useful. – Blue Magister May 20 '13 at 14:51
  • Updated the dput(head(a/b)) in the questions. Also included sample data set-b, if needed. – Gaurav Nemade May 20 '13 at 16:12

3 Answers3

4

Here's a data.table solution:

# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)

# include library, convert to data.table
library(data.table)

a = data.table(a)
b = data.table(b, key = "Ip1")

# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip")            # you can also include, exclude whatever columns you want
a
#          ip Code   Country      Area1          Area2
# 1: 16777995   CN     CHINA     FUJIAN         FUZHOU
# 2: 16777687   CN     CHINA     FUJIAN         FUZHOU
# 3: 16777871   CN     CHINA     FUJIAN         FUZHOU
# 4: 16778280   AU AUSTRALIA   VICTORIA      MELBOURNE
# 5: 16777346   AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562   CN     CHINA     FUJIAN         FUZHOU
# 7: 16777637   CN     CHINA     FUJIAN         FUZHOU
# 8: 16777634   CN     CHINA     FUJIAN         FUZHOU
# 9: 16778161   CN     CHINA     FUJIAN         FUZHOU
#10: 16777875   CN     CHINA     FUJIAN         FUZHOU

Had Ip1 been an exhaustive list of numbers that ip could match, then above would simply be a merge (of Ip1 in b with first column of a, i.e. ip), but data.table also provides an option of what to do when there is no exact match. You can tell it to e.g. roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table for a little more information.

eddi
  • 49,088
  • 6
  • 104
  • 155
3

Couldn't you remove the 2nd loop using,

j = intersect(which(ip1 < x[i]), which(ip2 > x[i]))
if  (length(j)==1){
         a$country[i]<-country[j]
         a$area1[i]<-area1[j]
         a$area2[i]<-area2[j]
}else{
         cat("Multiple matches found!\n")  
}
harkmug
  • 2,725
  • 22
  • 26
  • Thanks for the help. Didn't know about intersect & which. This makes the process faster but it is still taking a long time (around 120 iterations/min - 22 hours for entire process). Is there some a faster way than this? – Gaurav Nemade May 20 '13 at 14:33
  • @GauravNemade - was `data.table` of `findInterval` faster? "Accept" the answer that worked best for you. – harkmug May 20 '13 at 20:30
3

I would try findInterval:

#create example
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)
indices <- findInterval(a$ip,b$Ip1,rightmost.closed=FALSE,all.inside=FALSE)
a <- data.frame(a,b[indices,c("Country","Area1","Area2")])
Blue Magister
  • 13,044
  • 5
  • 38
  • 56