2

I have two tables in pandas:

df1: Containing User IDs and IP_Addresses for 150K users.

|---------------|---------------|  
|    User_ID    |   IP_Address  |
|---------------|---------------|  
|      U1       |   732758368.8 |
|      U2       |   350311387.9 |
|      U3       |   2621473820  |
|---------------|---------------|

df2: Containing IP Address range and country it belongs to, 139K records

|---------------|-----------------|------------------|  
|    Country    | Lower_Bound_IP  |  Upper_Bound_IP  |
|---------------|-----------------|------------------|  
|   Australia   |   1023787008    |    1023791103    |
|   USA         |   3638734848    |    3638738943    |
|   Australia   |   3224798976    |    3224799231    |
|   Poland      |   1539721728    |    1539721983    |
|---------------|-----------------|------------------|

My objective is to create a country column in df1 such that IP_Address of df1 lies between the range of Lower_Bound_IP and Upper_Bound_IP of that country in df2.

|---------------|---------------|---------------|   
|    User_ID    |   IP_Address  |    Country    |
|---------------|---------------|---------------|   
|      U1       |   732758368.8 |   Indonesia   |
|      U2       |   350311387.9 |   Australia   |
|      U3       |   2621473820  |   Albania     |
|---------------|---------------|---------------|

My first approach was to do a cross join (cartesian product) of the two tables and then filter to the relevant records. However, a cross join using pandas.merge() is not feasible, since it will create 21 billion records. The code crashes every time. Could you please suggest an alternative solution which is feasible?

cmaher
  • 5,100
  • 1
  • 22
  • 34
DS_Enthusiast
  • 65
  • 1
  • 8
  • Are the IP_Address ranges comprehensive? i.e., are there IP_Address values in `df1` for which you'd expect `Country` to be null? – cmaher Apr 14 '18 at 00:38
  • @cmaher I will assume for now that the ranges are comprehensive so that there will be no null country for any user. – DS_Enthusiast Apr 14 '18 at 01:08

1 Answers1

2

I'm not really sure how to deal with doing this with pandas.where, but with numpy.where you can do

idx = numpy.where((df1.Ip_Address[:,None] >= df2.Lower_Bound_IP[None,:]) 
    & (df1.IP_Address[:,None] <= df2.Upper_Bound_IP[None,:]))[1]
df1["Country"] = df2.Country[idx]

numpy.where gives the indices where the given condition is True. & corresponds to 'and', and the whole [:,None] bit adds a dummy axis where None is located. This makes sure that for each User_ID, the indices in df2 is found where the IP_Address is within the range. [1] gives the indices in df2 where the condition is True. This will break down if there's overlap in your ranges in df2.

This might still cause you to have memory issues, but you could add a loop such that you do this comparison in batches. E.g.

batch_size = 1000
n_batches = df1.shape[0] // batch_size
# Integer division rounds down, so if the number
# of User_ID's is not divisable by the batch_size,
# we need to add 1 to n_batches
if n_batches * batch_size < df1.shape[0]:
    n_batches += 1
indices = []
for i in range(n_batches):
    idx = numpy.where((df1.Ip_Address[i*batch_size:(i+1)*batch_size,None]
            >= df2.Lower_Bound_IP[None,:]) & 
            (df1.IP_Address[i*batch_size:(i+1)*batch_size,None] 
            <= df2.Upper_Bound_IP[None,:]))[1]
    indices.extend(idx.tolist())

df1["Country"] = df2.Country[np.asarray(indices)]
user2653663
  • 2,818
  • 1
  • 18
  • 22
  • It works like charm. Thank you so much. The batch calculation is really helpful in my case. Substantially eases load on the memory. – DS_Enthusiast Apr 14 '18 at 22:32