0

I posted a question along the same lines yesterday. This is a slightly modified version of it. previous question here.

I have 2 dataframes as follows:

data1 looks like this:

id          address       
1          11123451
2          78947591

data2 looks like the following:

lowerbound_address   upperbound_address    place
78392888                 89000000            X
10000000                 20000000            Y

I want to create another column in data1 called "place" which contains the place the id is from. There will be many ids coming from the same place. And some ids don't have a match.

The addresses here are float values.

What I am actually looking for in Python is an equivalent of this in R. It's easier to code the following in R. But I am unsure of how to code this in Python. Can someone help me with this?

data_place = rep(NA, nrow(data1))
for (i in (1:nrow(data1)){
tmp = as.character(data2[data1$address[i] >= data2$lowerbound_address & data1$address[i] <= data2$upperbound_address, "place"])
if(length(tmp)==1) {data_place[i] = tmp}
}

data$place = data_place
Community
  • 1
  • 1
Gingerbread
  • 1,938
  • 8
  • 22
  • 36
  • I may have misunderstood you, but it seems all the work is being done in `data1`, if so, what is the purpose of `data2`? – Leb Dec 24 '16 at 03:03
  • @Leb: I want to map each id in data1 to the place in data2 by checking if the address of the id in data1 falls within the bounds (lower and upper bound addresses) in data2. If it does, then my new column "place" in data1 should contain the place. – Gingerbread Dec 24 '16 at 03:08

2 Answers2

2

Something like this would work.

import pandas as pd
import numpy as np

# The below section is only used to import data

from io import StringIO

data = """      
id          address       
1          11123451
2          78947591
3          50000000
"""

data2 = """
lowerbound_address   upperbound_address    place
78392888                 89000000            X
10000000                 20000000            Y
"""

# The above section is only used to import data

df = pd.read_csv(StringIO(data), delimiter='\s+')
df2 = pd.read_csv(StringIO(data2), delimiter='\s+')

df['new']=np.nan

df['new'][(df['address'] > df2['lowerbound_address'][0]) & (df['address'] < df2['upperbound_address'][0])] = 'X'
df['new'][(df['address'] > df2['lowerbound_address'][1]) & (df['address'] < df2['upperbound_address'][1])] = 'Y'

In addition to pandas, we used numpy for np.nan.

All I have done was create a new column and assign NaN to it. Then created two criteria to assign either X or 'Y' based on the upper and lower boundaries in the second data (last two lines).

Final results:

   id   address  new
0   1  11123451    Y
1   2  78947591    X
2   3  50000000  NaN
Leb
  • 15,483
  • 10
  • 56
  • 75
  • How do I do it when I have a long list of places to be mapped? It's not just X and Y. It's some 3000 different places that need to be mapped to the ids based on their ip address. – Gingerbread Dec 24 '16 at 04:02
  • Sorry for the late response regarding your first comment, you can create a dictionary from `data2` using `df2.to_dict()` then loop through where the values are `lowerbound_address` and `upperbound_address` and the key as your assigned country value (the default method for `pandas`). – Leb Dec 24 '16 at 12:54
  • I simply used your code in a for loop and it worked :) – Gingerbread Jan 08 '17 at 18:29
0

Do a merge_asof and then replace all those times that the address is out of bounds with nan.

data1.sort_values('address', inplace = True)
data2.sort_values('lowerbound_address', inplace=True)
data3 = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
data3['place'] = data3['place'].where(data3.address <= data3.upperbound_address)
data3.drop(['lowerbound_address', 'upperbound_address'], axis=1)

Output

   id   address place
0   1  11123451     Y
1   3  50000000   NaN
2   2  78947591     X
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • Does the merge_asof come with a newer version of pandas or something? Because it throws this error: AttributeError: 'module' object has no attribute 'merge_asof'. Do you happen to know why? – Gingerbread Dec 24 '16 at 03:15
  • Yes it's new as of the last release 0.19 – Ted Petrou Dec 24 '16 at 03:16
  • Make sure pd.__version__ is 0.19. Check the docs http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html – Ted Petrou Dec 24 '16 at 03:33
  • Okay, it's working now. However, this code after running gives NaNs in the place as well as the lower and upper bound columns. I will post a screenshot of my dataframe in the question. – Gingerbread Dec 24 '16 at 03:46
  • You can drop extra columns. merge_asof is going to be orders of magnitudes faster than the accepted solution on larger dataframes. I updated my answer and show the output of the final dataframe – Ted Petrou Dec 24 '16 at 05:32