0

I am trying to assign value to a column for all rows selected based on a condition. Solutions for achieving this are discussed in several questions like this one. The standard solution are of the following syntax:

df.loc[row_mask, cols] = assigned_val

Unfortunately, this standard solution takes forever. In fact, in my case, I didn't manage to get even one assignment complete.

Update: More info about my dataframe: I have ~2 Million rows in my dataframe and I am trying to update the value of one column in my dataframe for rows that are selected based on a condition. On average, the selection condition is satisfied by ~10 rows.

Is it possible to speed up this assignment operation? Also, are there any general guidelines for multiple assignments with pandas in general.

Arul
  • 303
  • 1
  • 5
  • 16
  • please explain a bit more your use case, the size of the dataframe, etc... to help speed up if possible, because `loc` is pretty standard to access several rows at once and in most of the case it is fast enough. – Ben.T Oct 13 '21 at 15:41

2 Answers2

0

I believe .loc and .at are the differences you're looking for. .at is meant to be faster based on this answer.

DataPlug
  • 340
  • 3
  • 10
0

You could give np.where a try.

Here is an simple example of np.where

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['B'] = np.where(df['B']< 50, 100000, df['B'])

np.where() do nothing if condition fails has another example.

In your case, it might be

df[col] = np.where(df[col]==row_condition, assigned_val, df[col])

I was thinking it might be a little quicker because it is going straight to numpy instead of going through pandas to the underlying numpy mechanism. This article talks about Pandas vs Numpy on large datasets: https://towardsdatascience.com/speed-testing-pandas-vs-numpy-ffbf80070ee7#:~:text=Numpy%20was%20faster%20than%20Pandas,exception%20of%20simple%20arithmetic%20operations.

Nesha25
  • 371
  • 4
  • 11