Optimising concatenation in Python on Pandas dataframe

Question

I've got a dataframe which, for MVP purposes, looks like:

Val_x, Val_y
NaN, 2
1, Nan

And I'm trying to generate a new column, val, which is the concatenation of these two.

Currently I'm doing it via:

df['Val']=[df.iloc[j]['Val_x'] if pd.isnull(df.iloc[j]['Val_y']) else df.iloc[j]['Val_y'] for j in range(len(df))]

But this is not very speed efficient, nor does it feel very pythonic/pandas-native.

I.e. I want to arrive at:

Val_x, Val_y, val
Nan,    2,    2
1,    NaN,    1

Where I can then do drop(val_x),drop(val_y)

If anyone has a pointer I'd appreciate it?

Thanks

Edit:

I'm trying to generate a new column, which is effectively the 'Non-NAN' value of two other columns. The reason for this is that, for technical reasons, I've had to split my dataframe into two objects to process differently, and now wish to re-join it. So my original dataframe was index, val, and I've now got a new dataframe which is index, val_x, val_y, where some of val_x is NaN, and some of val_y is NaN, but never both NaN, and now wish to simplify that down into one column again

Could you explain what are you trying to do? – MaxU - stand with Ukraine Mar 10 '17 at 11:02 — MaxU - stand with Ukraine, Mar 10 '17 at 11:02

score 2 · Accepted Answer · edited May 23 '17 at 12:31

UPDATE3: here is a brilliant generic vectorized solution from @AndyJones, which finds first non-null value per row

df['new'] = df.T.bfill().iloc[0]

UPDATE2: Timing for 400.000 rows DF

In [269]: df = pd.concat([df] * 10**5, ignore_index=True)

In [270]: df.shape
Out[270]: (400000, 2)

In [271]: %timeit df['Val_x'].fillna(df['Val_y'])
10 loops, best of 3: 127 ms per loop

In [272]: %timeit df['Val_y'].where(df['Val_y'].notnull(), df['Val_x'])
10 loops, best of 3: 98.9 ms per loop

In [273]: %timeit np.where(df['Val_x'].notnull(), df['Val_x'], df['Val_y'])
10 loops, best of 3: 31 ms per loop

UPDATE:

df['Val'] = df['Val_x'].fillna(df['Val_y'])

OLD answer:

Here is a generic (should work for any number of columns), but slow (.apply(..., axis=1)) solution:

In [256]: df
Out[256]:
   Val_x Val_y
0    NaN     2
1    1.0   Nan
2    2.0     3
3    NaN   NaN

In [257]: df['Val'] = df.apply(lambda x: x[x.first_valid_index()] if x.first_valid_index()
                                         else np.nan,
                               axis=1)

In [258]: df
Out[258]:
   Val_x Val_y  Val
0    NaN     2    2
1    1.0   Nan    1
2    2.0     3    2
3    NaN   NaN  NaN

Thanks, a colleague also solved this for me and will post his answer :) — Henry, Mar 10 '17 at 11:26

score 2 · Answer 2 · answered Mar 10 '17 at 11:27

2

A colleague solved this for me with:

df['Val'] = df['Val_y'].where(df['Val_y'].notnull(), df['Val_x'])

For benchmarking references, on my data frame, the original code took about 22 seconds to run, the supplied version in my answer here took <1 second to run. I've not tried to benchmark MaxU's suggestion.

answered Mar 10 '17 at 11:27

Henry

1,646
12
28

This will be much faster compared to mine. I was trying to find a generic solution which would work for 3+ columns... I've added another solution for your particular case... – MaxU - stand with Ukraine Mar 10 '17 at 11:28
Thanks for the info :) – Henry Mar 10 '17 at 11:29

score 0 · Answer 3 · answered Mar 10 '17 at 11:30

0

you can fill NaN for 0 and sum columns:

data = [[1,numpy.nan],[numpy.nan,4]]
df = DataFrame(data, columns=['col1','col2'])
df:
   col1  col2
0   1.0   NaN
1   NaN   4.0
df = df.fillna(0)
df['col3'] = df['col1']+df['col2']
df:
   col1  col2  col3
0   1.0   0.0   1.0
1   0.0   4.0   4.0

answered Mar 10 '17 at 11:30

Ika8

391
1
12

Oh that's a clever one, I like it. not generically applicable in case of non-numeric data, but overall quite elegant. – Henry Mar 10 '17 at 11:31
This would work only in the case when there is only one "not-null" value per row. In that case we could use: `df['col3'] = df.sum(axis=1)` – MaxU - stand with Ukraine Mar 10 '17 at 11:44

Optimising concatenation in Python on Pandas dataframe

3 Answers3