0

There are 3 columns in my existing dataframe df: A, B, C

I want to add another col D based on A, B, C

The logic is:

if (A == "a"):
    D = "a"
elif (A == "b") and (B in ["B", "C"]):
    D = "A"
elif (C == "c"):
    D = "c"
Note: the value of D can be NaN if all conditions are not satisfied.

Is there any elegant and compact ways of adding this column?

Meng
  • 1,148
  • 5
  • 15
  • 23
  • 1
    Possible duplicate of [pandas create new column based on values from other columns](http://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns) – lucasnadalutti Dec 08 '16 at 19:17

3 Answers3

4

A nested where should be the fastest

np.where(df.A == 'a', 'a', 
         np.where((df.A == 'b') & (df.B.isin(['B','C'])), 'A',
         np.where(df.C == 'c', 'c', np.nan)))

Speed Test

# create 100,000 rows of random data
df = pd.DataFrame({'A':np.random.choice(['a','b','c','A','B','C'], 100000, True),
               'B':np.random.choice(['a','b','c','A','B','C'], 100000, True), 
               'C':np.random.choice(['a','b','c','A','B','C'], 100000, True)})

%%timeit 
np.where(df.A == 'a', 'a', 
     np.where((df.A == 'b') & (df.B.isin(['B','C'])), 'A',
     np.where(df.C == 'c', 'c', np.nan)))

10 loops, best of 3: 33.4 ms per loop

def my_logic(x):
    if x[0] == 'a':
        return 'a'
    elif x[0] == 'b' and x[1] in ('B', 'C'):
        return 'A'
    elif x[2] == 'c':
        return 'c'
    return ''

%%timeit
df[['A', 'B', 'C']].apply(my_logic, axis=1)

1 loops, best of 3: 5.87 s per loop

Nested where is 175 times faster than apply - the method of last resort.

Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
0

I think this is the most readable method, while still being somewhat compact.

def my_logic(x):
    if x[0] == 'a':
        return 'a'
    elif x[0] == 'b' and x[1] in ('B', 'C'):
        return 'A'
    elif x[2] == 'c':
        return 'c'
    return ''

df['D'] = df[['A', 'B', 'C']].apply(my_logic, axis=1)
Alex
  • 12,078
  • 6
  • 64
  • 74
  • `apply` should be a method of last resort as it can be incredibly slow for even just moderately sized dataframes as its not vectorized. Perhaps part of the problem is the vast array of apply functions in R. Pandas is probably best learned having no knowledge of apply. – Ted Petrou Dec 08 '16 at 20:17
  • @TedPetrou Nice work with the speed test in your comment! I believe that apply is an excellent pandas tool for beginners because it allows for code to be written in a more readable format. – Alex Dec 08 '16 at 22:31
0

This would be faster than the if/elif solution and less lines. However it is arguably not as readable.

df.loc[df.A=="a", "D") = "a"
df.loc[(df.A=="b") & df.B.isin("B", "C"), "D") = "A"
df.loc[(df.C=="c") & ~df.A.isin("a", "A"), "D"] = "c"
df.loc[~df.D.isin("a", "A", "c"), "D"] = np.nan
simon
  • 2,561
  • 16
  • 26