1

I've got a dataframe which, for MVP purposes, looks like:

Val_x, Val_y
NaN, 2
1, Nan

And I'm trying to generate a new column, val, which is the concatenation of these two.

Currently I'm doing it via:

df['Val']=[df.iloc[j]['Val_x'] if pd.isnull(df.iloc[j]['Val_y']) else df.iloc[j]['Val_y'] for j in range(len(df))]

But this is not very speed efficient, nor does it feel very pythonic/pandas-native.

I.e. I want to arrive at:

Val_x, Val_y, val
Nan,    2,    2
1,    NaN,    1

Where I can then do drop(val_x),drop(val_y)

If anyone has a pointer I'd appreciate it?

Thanks

Edit:

I'm trying to generate a new column, which is effectively the 'Non-NAN' value of two other columns. The reason for this is that, for technical reasons, I've had to split my dataframe into two objects to process differently, and now wish to re-join it. So my original dataframe was index, val, and I've now got a new dataframe which is index, val_x, val_y, where some of val_x is NaN, and some of val_y is NaN, but never both NaN, and now wish to simplify that down into one column again

Henry
  • 1,646
  • 12
  • 28

3 Answers3

2

UPDATE3: here is a brilliant generic vectorized solution from @AndyJones, which finds first non-null value per row

df['new'] = df.T.bfill().iloc[0]

UPDATE2: Timing for 400.000 rows DF

In [269]: df = pd.concat([df] * 10**5, ignore_index=True)

In [270]: df.shape
Out[270]: (400000, 2)

In [271]: %timeit df['Val_x'].fillna(df['Val_y'])
10 loops, best of 3: 127 ms per loop

In [272]: %timeit df['Val_y'].where(df['Val_y'].notnull(), df['Val_x'])
10 loops, best of 3: 98.9 ms per loop

In [273]: %timeit np.where(df['Val_x'].notnull(), df['Val_x'], df['Val_y'])
10 loops, best of 3: 31 ms per loop

UPDATE:

df['Val'] = df['Val_x'].fillna(df['Val_y'])

OLD answer:

Here is a generic (should work for any number of columns), but slow (.apply(..., axis=1)) solution:

In [256]: df
Out[256]:
   Val_x Val_y
0    NaN     2
1    1.0   Nan
2    2.0     3
3    NaN   NaN

In [257]: df['Val'] = df.apply(lambda x: x[x.first_valid_index()] if x.first_valid_index()
                                         else np.nan,
                               axis=1)

In [258]: df
Out[258]:
   Val_x Val_y  Val
0    NaN     2    2
1    1.0   Nan    1
2    2.0     3    2
3    NaN   NaN  NaN
Community
  • 1
  • 1
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
2

A colleague solved this for me with:

df['Val'] = df['Val_y'].where(df['Val_y'].notnull(), df['Val_x'])

For benchmarking references, on my data frame, the original code took about 22 seconds to run, the supplied version in my answer here took <1 second to run. I've not tried to benchmark MaxU's suggestion.

Henry
  • 1,646
  • 12
  • 28
0

you can fill NaN for 0 and sum columns:

data = [[1,numpy.nan],[numpy.nan,4]]
df = DataFrame(data, columns=['col1','col2'])
df:
   col1  col2
0   1.0   NaN
1   NaN   4.0
df = df.fillna(0)
df['col3'] = df['col1']+df['col2']
df:
   col1  col2  col3
0   1.0   0.0   1.0
1   0.0   4.0   4.0
Ika8
  • 391
  • 1
  • 12
  • Oh that's a clever one, I like it. not generically applicable in case of non-numeric data, but overall quite elegant. – Henry Mar 10 '17 at 11:31
  • This would work only in the case when there is only one "not-null" value per row. In that case we could use: `df['col3'] = df.sum(axis=1)` – MaxU - stand with Ukraine Mar 10 '17 at 11:44