0

Most efficient for a big dataset in pandas:

I would like to add a new column Z taking the value from X if there is a value, if not, I want to take the value from Y.

Another thing, it there a possibility to use ternary operations to add a new column Z based on, if column Y exist then column Y - column X, if not, then only X.

I'm looking for the most efficient way in both cases.

Thank you

Tonino Fernandez
  • 441
  • 4
  • 12

1 Answers1

1

Use numpy.where:

np.random.seed(123)

N = 10000
df = pd.DataFrame({'X':np.random.choice([np.nan, 1], size=N),
                   'Y':np.random.choice([3,4,6], size=N)})

df['Z1'] = np.where(df['X'].isna(), df['Y'],df['X'])

if 'Y' in df.columns:
    df['Z2'] = np.where(df['X'] - df['Y'], df['Y'],df['X'])
else:
    df['Z2'] = df['X']

print (df)
        X  Y   Z1   Z2
0     NaN  6  6.0  6.0
1     1.0  4  1.0  4.0
2     NaN  6  6.0  6.0
3     NaN  3  3.0  3.0
4     NaN  3  3.0  3.0
  ... ..  ...  ...
9995  1.0  6  1.0  6.0
9996  1.0  6  1.0  6.0
9997  NaN  6  6.0  6.0
9998  1.0  4  1.0  4.0
9999  1.0  6  1.0  6.0

[10000 rows x 4 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252