2

I have 2 dataframes as follows:

data1 looks like this:

id          address       
1          11123451
2          78947591

data2 looks like the following:

lowerbound_address   upperbound_address    place
78392888                 89000000            X
10000000                 20000000            Y

I want to create another column in data1 called "place" which contains the place the id is from. For example, in the above case, for id 1, I want the place column to contain Y and for id 2, I want the place column to contain X. There will be many ids coming from the same place. And some ids don't have a match.

I am trying to do it using the following piece of code.

places = []
    for index, row in data1.iterrows():
        for idx, r in data2.iterrows():
            if r['lowerbound_address'] <= row['address'] <= r['upperbound_address']:
                places.append(r['place'])

The addresses here are float values.

It's taking forever to run this piece of code. It makes me wonder if my code is correct or if there's a faster way of executing the same.

Any help will be much appreciated. Thank you!

Gingerbread
  • 1,938
  • 8
  • 22
  • 36
  • so data2 defines a mapping from ranges of addresses to places, right? data1["address"] is a pandas series. So maybe you could use [map](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) (or apply if you want to work on a dataframe). However, there may be faster methods, but i think this should already improve the speed here – Quickbeam2k1 Dec 23 '16 at 07:56
  • @Quickbeam2k1: Yes, for each id in data1, I want to determine the place that the user is from . Data2 contains the lower and upper bound of addresses for every place and I have to map every id from data1 to a place in data2 by checking if the address of the id lies between the two bounds given in data2. – Gingerbread Dec 23 '16 at 08:07
  • Will there be multiple matches or no match at all for some address? The current code does not handle that correctly. – YS-L Dec 23 '16 at 08:08
  • There will be many ids coming from the same place. And some ids don't have a match. – Gingerbread Dec 23 '16 at 08:18
  • How many rows are in either data frame? Does data derive from a database (e.g., SQL Server, Postgres) or do you have access to one such as file-level SQLite? For interval merges, consider the cross join filter but in SQL where filter is used in `WHERE` clause (the implicit join and usual first commands in order of operations). – Parfait Dec 23 '16 at 08:49
  • @Parfait: data1 contains 151112 rows and data2 has 138846 rows. I have the data in csv format and it doesn't derive from any database. – Gingerbread Dec 23 '16 at 09:21
  • Well, you can import csvs into SQLite (free, open source DBMS of which Python maintains a built-in default API), then run the cross join/filter query. Using `read_sql`, pandas can import result. – Parfait Dec 23 '16 at 09:43

1 Answers1

3

You can use first cross join with merge and then filter values by boolean indexing. Last remove unecessary columns by drop:

data1['tmp'] = 1
data2['tmp'] = 1
df = pd.merge(data1, data2, on='tmp', how='outer')
df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
print (df)
   id   address place
1   1  11123451     Y
2   2  78947591     X

Another solution with itertuples, last create DataFrame.from_records:

places = []
for row1 in data1.itertuples():
    for row2 in data2.itertuples():
        #print (row1.address)
        if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
            places.append((row1.id, row1.address, row2.place))    
print (places)
[(1, 11123451, 'Y'), (2, 78947591, 'X')]

df = pd.DataFrame.from_records(places)
df.columns=['id','address','place']
print (df)
   id   address place
0   1  11123451     Y
1   2  78947591     X

Another solution with apply:

def f(x):
    for row2 in data2.itertuples():
        if (row2.lowerbound_address <= x <= row2.upperbound_address):
            return pd.Series([x, row2.place], index=['address','place'])

df = data1.set_index('id')['address'].apply(f).reset_index()
print (df)
   id   address place
0   1  11123451     Y
1   2  78947591     X

EDIT:

Timings:

N = 1000:

If saome values are not in range, in solution b and c are omited. Check last row of df1.

In [73]: %timeit (data1.set_index('id')['address'].apply(f).reset_index())
1 loop, best of 3: 2.06 s per loop

In [74]: %timeit (a(df1a, df2a))
1 loop, best of 3: 82.2 ms per loop

In [75]: %timeit (b(df1b, df2b))
1 loop, best of 3: 3.17 s per loop

In [76]: %timeit (c(df1c, df2c))
100 loops, best of 3: 2.71 ms per loop

Code for timings:

np.random.seed(123)
N = 1000
data1 = pd.DataFrame({'id':np.arange(1,N+1), 
                   'address': np.random.randint(N*10, size=N)}, columns=['id','address'])

#add last row with value out of range
data1.loc[data1.index[-1]+1, ['id','address']] = [data1.index[-1]+1, -1]
data1 = data1.astype(int)
print (data1.tail())

data2 = pd.DataFrame({'lowerbound_address':np.arange(1, N*10,10), 
                      'upperbound_address':np.arange(10,N*10+10, 10),
                      'place': np.random.randint(40, size=N)})

print (data2.tail())
df1a, df1b, df1c = data1.copy(),data1.copy(),data1.copy()
df2a, df2b ,df2c = data2.copy(),data2.copy(),data2.copy()

def a(data1, data2):
    data1['tmp'] = 1
    data2['tmp'] = 1
    df = pd.merge(data1, data2, on='tmp', how='outer')
    df = df[(df.lowerbound_address <= df.address) & (df.upperbound_address >= df.address)]
    df = df.drop(['lowerbound_address','upperbound_address', 'tmp'], axis=1)
    return (df)

def b(data1, data2):
    places = []
    for row1 in data1.itertuples():
        for row2 in data2.itertuples():
            #print (row1.address)
            if (row2.lowerbound_address <= row1.address <= row2.upperbound_address):
                places.append((row1.id, row1.address, row2.place))    

        df = pd.DataFrame.from_records(places)
        df.columns=['id','address','place']

    return (df)

def f(x):
    #use for ... else for add NaN to values out of range
    #http://stackoverflow.com/q/9979970/2901002
    for row2 in data2.itertuples():
        if (row2.lowerbound_address <= x <= row2.upperbound_address):
             return pd.Series([x, row2.place], index=['address','place'])
    else:
        return pd.Series([x, np.nan], index=['address','place'])

def c(data1,data2):
    data1 = data1.sort_values('address')
    data2 = data2.sort_values('lowerbound_address')
    df = pd.merge_asof(data1, data2, left_on='address', right_on='lowerbound_address')
    df = df.drop(['lowerbound_address','upperbound_address'], axis=1)
    return df.sort_values('id')


print (data1.set_index('id')['address'].apply(f).reset_index())
print (a(df1a, df2a))
print (b(df1b, df2b))
print (c(df1c, df2c))

Only solution c with merge_asof works very nice with large DataFrame:

N=1M:

In [84]: %timeit (c(df1c, df2c))
1 loop, best of 3: 525 ms per loop

More about merge asof in docs.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you so much! I am not in a hurry!! – Gingerbread Dec 23 '16 at 08:24
  • unfortunately you need loop solution, `itertuples` is a bit better as `iterrows`. But if large dataframes, all solutions are slow, unfortunately. – jezrael Dec 23 '16 at 08:32
  • I think first time is correct, but I am not sure with append to list is only necessary - maybe data can be shifted, so need give to list id and address also. But I think apply solution can be faster, I can create some timings. – jezrael Dec 23 '16 at 08:39
  • Ohh yes makes sense. I am trying out the apply solution now! Thank you for all the three solutions!! – Gingerbread Dec 23 '16 at 08:42
  • The apply method threw an error: TypeError: object of type 'NoneType' has no len(). Do you know why? :| – Gingerbread Dec 23 '16 at 19:35
  • Yes, there is problem some value is not in range. See edit. – jezrael Dec 24 '16 at 11:16