Sort pandas dataframe based on two columns that are similar but one will be NaN if the other has a value

Question

I have a merged df which has 2 experiment IDs - experiment_a and experiment_b

They are in the general nomenclature EXPT_YEAR_NUM but some have add ons, of do not have a year instead of some other value. In this df where there is a value in experiment_a, experiment_b = NaN, and vice versa.

ie:

experiment_a    experiment_b
EXPT_2011_06     NaN
NaN              EXPT_2011_07

How do I sort so that the ascending values of experiment_a and _b are together, instead of it ascending on experiment_a with _b having all NaN values, then ascending with experiment_b when experiment_a have NaN values?

This is what happens when i use sort_values:

df = df.sort_values(['experiment_a', 'experiment_b'])

It clearly just sorts _a first, then _b.

Use `where` to construct a single column? – Andras Deak -- Слава Україні Feb 09 '18 at 11:35 — Andras Deak -- Слава Україні, Feb 09 '18 at 11:35
Can you add more values to sample with expected output? – jezrael Feb 09 '18 at 11:40 — jezrael, Feb 09 '18 at 11:40

jezrael · Accepted Answer · 2018-02-09T12:10:18.113

I believe you need fillna for Series, then get indices of sorted values by argsort and last select by iloc - output is sorted columns:

print (df)
   experiment_a  experiment_b
0  EXPT_2011_06           NaN
1  EXPT_2010_06           NaN
2           NaN  EXPT_2011_07

df = df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()]
print (df)
   experiment_a  experiment_b
1  EXPT_2010_06           NaN
0  EXPT_2011_06           NaN
2           NaN  EXPT_2011_07

Detail:

print (df['experiment_a'].fillna(df['experiment_b']))
0    EXPT_2011_06
1    EXPT_2010_06
2    EXPT_2011_07
Name: experiment_a, dtype: object

print (df['experiment_a'].fillna(df['experiment_b']).argsort())
0    1
1    0
2    2
Name: experiment_a, dtype: int64

I test more solutions, with np.where is a bit better performace, but mainly it depends of data:

print (df)
   experiment_a  experiment_b
0  EXPT_2011_03           NaN
1           NaN  EXPT_2009_08
2           NaN  EXPT_2010_06
3  EXPT_2010_07           NaN
4           NaN  EXPT_2011_07

#[500000 rows x 2 columns]
df = pd.concat([df] * 100000, ignore_index=True)

In [41]: %timeit (df.iloc[(np.where(df['experiment_a'].isnull(), df['experiment_b'], df['experiment_a'])).argsort()])
1 loop, best of 3: 318 ms per loop

In [42]: %timeit (df.iloc[df['experiment_a'].fillna(df['experiment_b']).argsort()])
1 loop, best of 3: 335 ms per loop

In [43]: %timeit (df.iloc[df['experiment_a'].combine_first(df['experiment_b']).argsort()])
1 loop, best of 3: 333 ms per loop

In [44]: %timeit (df.iloc[df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b).argsort()])
1 loop, best of 3: 342 ms per loop

score 1 · Answer 2 · answered Feb 09 '18 at 11:43

1

First construct a single column:

key = df.experiment_a.where(df.experiment_a.notnull(), df.experiment_b)

Then indices:

idx = key.argsort()

Finally:

df.iloc[idx]

answered Feb 09 '18 at 11:43

John Zwinck

239,568
38
324
436

Sort pandas dataframe based on two columns that are similar but one will be NaN if the other has a value

2 Answers2