Using Python 3, how can I speed up the update of a dataframe using current methods?

Question

I have a dataframe with ~5MM records. I have to update a column with a new value based on either an exact match to another column or a partial match. I've written the code to do this, but I could never figure out how to chain filters. Right now, my code updates the column for every single filter requirement. Right now my code looks like this.

In SQL it looks like:

UPDATE table
SET column1 = 'Value' 
WHERE column2 IN ('123','456','789')

OR

UPDATE table
SET column1 = 'a different value'
WHERE 1 = 1
AND column2 LIKE '987%'
OR column2 LIKE '654%'

Currently, in Python, I have:

df.loc[df['column2'] == '123', 'column1 '] = 'Value'
df.loc[df['column2'] == '456', 'column1 '] = 'Value'
df.loc[df['column2'] == '789', 'column1 '] = 'Value'

or

df.loc[df.column2.str.startswith('987'),'COMMODITY']='a different value'
df.loc[df.column2.str.startswith('654'),'COMMODITY']='a different value'

I'm guessing that Python has to loop through all 5MM records for each line which is super inefficient. What is the better way to do this?

It's been suggested that this is a duplicate of How to implement 'in' and 'not in' for Pandas dataframe. This question is five years old, and isin() doesn't work like that anymore.

score 2 · Accepted Answer · answered Mar 24 '19 at 23:08

2

How about using np.where with isin

df[ 'column1 '] = np.where(df.column2.isin(['123','456','789']),'Value',df.column1)

About the startwiths

df.COMMODITY=np.where(df.column2.str.startswith(('987','654')),'a different value',df.COMMODITY)

answered Mar 24 '19 at 23:08

BENY

317,841
20
164
234

That is a really wacky syntax and no way would I have put that together on my own. Thanks. I'm surprised that the isin() works now. Before I was getting errors about how it only takes two parameters. – Bob Wakefield Mar 24 '19 at 23:37

Using Python 3, how can I speed up the update of a dataframe using current methods?

1 Answers1