I've got two dataframes:
Dataframe df_transactions
has the column ip_address
And dataframe df_ip
has the columns lower_bound_ip_address
, upper_bound_ip_address
and country
.
I'm joining these dataframes to get the country of the ip address checking if df_transactions.ip_address
is between df_ip.lower_bound_ip_address
and upper_bound_ip_address
.
The code below works in a subset of the data:
a = df_transactions.ip_address.values
bh = df_ip.upper_bound_ip_address.values
bl = df_ip.lower_bound_ip_address.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df_transactions.values[i], df_ip.values[j]]),
columns=df_transactions.columns.append(df_ip.columns)
)
But when I try to do it for the entire dataframe, I'm getting the following error.
Do you know the better way to solve this ?
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_23472/1585362522.py in <module>
3 bl = df_ip.lower_bound_ip_address.values
4
----> 5 i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
6
7 pd.DataFrame(
MemoryError: Unable to allocate 15.5 GiB for an array with shape (120000, 138846) and data type bool
Thanks !