Speed up Pandas: find all columns which fullfill set of conditions

Question

I have data represented using pandas DataFrame, which for example looks as follows:

| id | entity | name | value | location

where id is an integer value, entity is an integer , name is a string, value is an integer, and location is a string (for example US, CA, UK etc).

Now, I want to add a new column to this data frame, column "flag", where values are assigned as follows:

for d in df.iterrows():

    if d.entity == 10 and d.value != 1000 and d.location == CA:
        d.flag = "A" 
    elif d.entity != 10 and d.entity != 0 and d.value == 1000 and d.location == US:
        d.flag = "C"
    elif d.entity == 0 and d.value == 1000 and d.location == US"
        d.flag = "B"
    else:
        print("Different case")

Is there a way to speed this up and use some built in functions instead of the for loop?

Erfan · Accepted Answer · 2019-04-10T13:31:05.130

4

Use np.select which you pass a list of conditions, based on those conditions you give it choices and you can specify a default value when none of the conditions is met.

conditions = [
    (d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
    (d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
    (d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]

choices = ["A", "C", "B"]

df['flag'] = np.select(conditions, choices, default="Different case")

edited Apr 10 '19 at 13:31

answered Apr 10 '19 at 13:30

Erfan

40,971
8
66
78

Thanks, this looks interesting! One question - should I do it per row (since d is a single row), so still in a loop? – Ziva Apr 10 '19 at 13:46
1

No, this is a vectorized solution. Loops in python general and pandas are slow. So use the last line to create your new column called `flag`. @Ziva – Erfan Apr 10 '19 at 13:47

score 4 · Answer 2 · answered Apr 10 '19 at 13:31

Add () with bitwise and -> & for working with numpy.select:

m = [
    (d.entity == 10) & (d.value != 1000) & (d.location == 'CA'),
    (d.entity != 10) & (d.entity != 0) & (d.value == 1000) & (d.location == 'US'),
    (d.entity == 0) & (d.value == 1000) & (d.location == 'US')
]

df['flag'] = np.select(m, ["A", "C", "B"], default="Different case")

Avish · Answer 3 · 2019-04-10T13:50:30.207

You wrote "find all columns which fulfill a set of conditions", but your code shows you're actually trying to add a new column whose value for each row is computed from the values of other columns of the same row.

If that's indeed the case, you can use df.apply, giving it a function that computes the value for a specific row:

def flag_value(row):
    if row.entity == 10 and row.value != 1000 and row.location == CA:
        return "A"
    elif row.entity != 10 and row.entity != 0 and row.value == 1000 and row.location == US:
        return "C"
    elif row.entity == 0 and row.value == 1000 and row.location == US:
        return "B"
    else:
        return "Different case"

df['flag'] = df.apply(flag_value, axis=1)

Take a look at this related question for more information.

If you truly want to find all columns which specify some condition, the usual way to do this with a Pandas dataframe is to use df.loc and indexing:

only_a_cases = df.loc[df.entity == 10 & df.value != 1000 & df.location == "CA"]
# or:
only_a_cases = df.loc[lambda df: df.entity == 10 & df.value != 1000 & df.location == "CA"]

Speed up Pandas: find all columns which fullfill set of conditions

3 Answers3