0

I've got two dataframes:
Dataframe df_transactions has the column ip_address
And dataframe df_ip has the columns lower_bound_ip_address, upper_bound_ip_address and country.

I'm joining these dataframes to get the country of the ip address checking if df_transactions.ip_address is between df_ip.lower_bound_ip_address and upper_bound_ip_address.

The code below works in a subset of the data:

a = df_transactions.ip_address.values
bh = df_ip.upper_bound_ip_address.values
bl = df_ip.lower_bound_ip_address.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.DataFrame(
    np.column_stack([df_transactions.values[i], df_ip.values[j]]),
    columns=df_transactions.columns.append(df_ip.columns)
)

But when I try to do it for the entire dataframe, I'm getting the following error.
Do you know the better way to solve this ?

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_23472/1585362522.py in <module>
      3 bl = df_ip.lower_bound_ip_address.values
      4 
----> 5 i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
      6 
      7 pd.DataFrame(

MemoryError: Unable to allocate 15.5 GiB for an array with shape (120000, 138846) and data type bool

Thanks !

0 Answers0