Using np.where, or another broadcasting technique in pandas dataframe

Question

I have a dataframe with a couple of columns that need to have various columns populated depending upon conditions. I wrote out a function, and have been using df.apply, however this is obviously exceptionally slow. I'm looking for help in creating a faster way to do the following:

def function(df):
    if pd.isnull(df['criteria_column']) == True:
        return df['return_column']
    else:
        return
df['new_column'] = df.apply(function, axis=1)

I'd like to do something like:

 df['new_column'] = np.where(pd.isnull(df['criteria_column'] == True),
                                       df['return_column'], "")

However this results in ValueError: Could not construct Timestamp from argument <type 'bool'>

metaperture · Accepted Answer · 2014-05-29T12:38:41.680

4

Use indexing instead of apply, it's much faster:

df["new_column"] = ""
is_null = pd.isnull(df["criteria_column"])
df["new_column"][is_null] = df["return_column"][is_null] # method 1

For reference sake, here are a few more ways of doing the same thing as the last line:

df["new_column"][is_null] = df["return_column"][is_null] # method 1
df["new_column"].loc[is_null] = df.loc["return_column"].loc[is_null] # method 2
df.loc[is_null, "new_column"] = df.loc[is_null, "return_column"] # method 3, thanks @joris

For those curious, methods 1 and 2 access the pandas.Series that is the column, and do selected assignments on them. Note especially that series[is_null] ends up calling series.loc[is_null] eventually anyway in this instance.

Lastly, method 3 is a convenience method for doing method 2 that removes possible ambiguities, reduces memory used, and will permit assignments after successive chaining. If you're doing complex selection chaining and don't want intermediate copies or want to assign to the selection, that method will likely be better. See pandas documentation

edited May 29 '14 at 12:38

answered May 28 '14 at 22:56

metaperture

2,393
1
18
19

3

Indeed, but it's better to use `df.loc[is_null, "new_column"]` instead of `df["new_column"][is_null]` to avoid problems with chained assignment. – joris May 28 '14 at 22:58
iirc, *.loc is much faster, but I find the chained assignment to be much more readable, so I'd prefer it unless there was a hard reason not to. What issues with chained assignment have you come across? – metaperture May 28 '14 at 23:02
2

Explanation in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy, another example on SO: http://stackoverflow.com/questions/21463589/pandas-chained-assignments. Probably in this case it won't be a problem (but eg if you switch `["new_column"]` and `[is_null]` order, it could be a problem), but because it sometimes causes problems, and it is not always easy to tell when/when not it will cause problems, it is better to try to use the `loc` idiom to prevent this. – joris May 28 '14 at 23:09
Thanks. For those interested, when I previously used `.apply` for 3 functions, it took ~40 seconds. The `.loc` method takes ~0.2 seconds. 200x faster. – DataSwede May 29 '14 at 01:07

Using np.where, or another broadcasting technique in pandas dataframe

1 Answers1