23

Suppose I have two dataframes:

>> df1

   0  1  2
0  a  b  c
1  d  e  f

>> df2

   0  1  2
0  A  B  C
1  D  E  F

How can I interleave the rows? i.e. get this:

>> interleaved_df

   0  1  2
0  a  b  c
1  A  B  C
2  d  e  f
3  D  E  F

(Note my real DFs have identical columns, but not the same number of rows).


What I've tried

inspired by this question (very similar, but asks on columns):

import pandas as pd
from itertools import chain, zip_longest

df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])  
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])

concat_df = pd.concat([df1,df2])

new_index = chain.from_iterable(zip_longest(df1.index, df2.index))
# new_index now holds the interleaved row indices

interleaved_df = concat_df.reindex(new_index)

ValueError: cannot reindex from a duplicate axis

The last call fails because df1 and df2 have some identical index values (which is also the case with my real DFs).

Any ideas?

OmerB
  • 4,134
  • 3
  • 20
  • 33

5 Answers5

24

You can sort the index after concatenating and then reset the index i.e

import pandas as pd

df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])  
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])

concat_df = pd.concat([df1,df2]).sort_index().reset_index(drop=True)

Output :

   0  1  2
0  a  b  c
1  A  B  C
2  d  e  f
3  D  E  F

EDIT (OmerB) : Incase of keeping the order regardless of the index value then.

import pandas as pd
df1 = pd.DataFrame([['a','b','c'], ['d','e','f']]).reset_index()  
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']]).reset_index()

concat_df = pd.concat([df1,df2]).sort_index().set_index('index')
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
  • Thanks! When trying this out, it didn't work because my original DFs don't have monotonic indexes (i.e. the 2nd row can have a smaller index than the 1st row. Sorry, I realize it wasn't obvious from the toy example I've posted...). But anyway - based on your answer I just needed to add reset_index to the original DFs and set_index afterwards, and it worked). – OmerB Aug 08 '17 at 11:08
  • Newbie question - I've submitted an edit to your answer with my fix. Should I wait till it's accepted before accepting the answer? – OmerB Aug 08 '17 at 11:45
  • 12
    Note that you should use `.sort_index(kind="mergesort")` to ensure preservation of order. Merge sort is stable, quick sort (the default) is not. – reve_etrange Oct 26 '17 at 17:51
  • @reve_etrange Interesting. Can you please explain why that is the case? – Somebody Out There Aug 04 '22 at 04:53
6

Use toolz.interleave

In [1024]: from toolz import interleave

In [1025]: pd.DataFrame(interleave([df1.values, df2.values]))
Out[1025]:
   0  1  2
0  a  b  c
1  A  B  C
2  d  e  f
3  D  E  F
Zero
  • 74,117
  • 18
  • 147
  • 154
  • This one is very elegant, but it's not applicable in my case because we lose the meta-data from the original DataFrame (index, data types etc.) – OmerB Aug 08 '17 at 11:42
1

Here's an extension of @Bharath's answer that can be applied to DataFrames with user-defined indexes without losing them, using pd.MultiIndex.

Define Dataframes with the full set of column/ index labels and names:

df1 = pd.DataFrame([['a','b','c'], ['d','e','f']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])  
df1.columns.name = 'cols'
df1.index.name = 'rows'
df2 = pd.DataFrame([['A','B','C'], ['D','E','F']], index=['one', 'two'], columns=['col_a', 'col_b','col_c'])
df2.columns.name = 'cols'
df2.index.name = 'rows'

Add DataFrame ID to MultiIndex:

df1.index = pd.MultiIndex.from_product([[1], df1.index], names=["df_id", df1.index.name])
df2.index = pd.MultiIndex.from_product([[2], df2.index], names=["df_id", df2.index.name])

Then use @Bharath's concat() and sort_index():

data = pd.concat([df1, df2], axis=0, sort=True)
data.sort_index(axis=0, level=data.index.names[::-1], inplace=True)

Output:

cols       col_a col_b col_c
df_id rows                  
1     one      a     b     c
2     one      A     B     C
1     two      d     e     f
2     two      D     E     F

Henard
  • 11
  • 2
0

You can try this way :

In [31]: import pandas as pd
    ...: from itertools import chain, zip_longest
    ...: 
    ...: df1 = pd.DataFrame([['a','b','c'], ['d','e','f']])  
    ...: df2 = pd.DataFrame([['A','B','C'], ['D','E','F']])

In [32]: concat_df = pd.concat([df1,df2]).sort_index()
    ...: 

In [33]: interleaved_df = concat_df.reset_index(drop=1)

In [34]: interleaved_df
Out[34]: 
   0  1  2
0  a  b  c
1  A  B  C
2  d  e  f
3  D  E  F
cs95
  • 379,657
  • 97
  • 704
  • 746
Md. Rezwanul Haque
  • 2,882
  • 7
  • 28
  • 45
0

You could also preallocate a new DataFrame, and then fill it using a slice.

def interleave(dfs):
    data = np.transpose(np.array([np.empty(dfs[0].shape[0]*len(dfs), dtype=dt) for dt in dfs[0].dtypes]))
    out = pd.DataFrame(data, columns=dfs[0].columns)
    for ix, df in enumerate(dfs):
        out.iloc[ix::len(dfs),:] = df.values
    return out

The preallocation code is taken from this question.

While there's a chance it could outperform the index method for certain data types / sizes, it won't behave gracefully if the DataFrames have different sizes.

Note - for ~200000 rows with 20 columns of mixed string, integer and floating types, the index method is around 5x faster.

reve_etrange
  • 2,561
  • 1
  • 22
  • 36