2

Imagine I have pandas dataframe:

Column1 Column2

A            D

B            E

C            F

How to get resulting Dataframe in this form?

Column

 A
 D
 B
 E
 C
 F
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
MichiganMagician
  • 273
  • 2
  • 15

2 Answers2

5

EDIT: see the benchmark below for a slightly faster solution.

You can do this:

# Import pandas library 
import pandas as pd

# The data
data = [["A", "D"], ["B", "E"], ["C", "F"]]

# Create DataFrame
df = pd.DataFrame(data, columns = ["Column1", "Column2"]) 

# Flatten and convert to DataFrame
new_df = pd.DataFrame(df.to_numpy().flatten())

print(df)

Output:

A
D
B
E
C
F

new_df will be a pandas.DataFrame.

Note the use of df.to_numpy() too.

And as suggested by @Michael Szczesny you can do:

new_series = df.stack().reset_index(drop=True)

Which wil return a pandas.Series.

Addded Benchmark:

Based on @Mayank Porwal's answer I add this benchmark results. I used timeit.repeat with repeat = 7, number = 10000. Sorted from fastest to slowest:

new_df = pd.DataFrame(df.to_numpy().ravel('A')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('K')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('F')) # 51.1 µs
new_df = pd.DataFrame(df.to_numpy().flatten())  # 52.6 µs
new_df = pd.DataFrame(df.to_numpy().ravel('C')) # 53.4 µs
new_series = df.stack().reset_index(drop=True)  # 322.0 µs

Using numpy.ravel is fastest mainly because it returns a view whereas numpy..to_numpy() returns a copy. For details about numpy.ravel see: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ravel.html

In short, "A" will force to read the elements in Fortran-like index order if the array is Fortran contiguous in memory and with "K" it will read the elements in the order they occur in memory.

Rivers
  • 1,783
  • 1
  • 8
  • 27
3

Use df.to_numpy with numpy.ravel:

In [2349]: x = pd.DataFrame(df.to_numpy().ravel('F'))

In [2350]: x
Out[2350]: 
     0
0    A
1    B
2    C
3    D
4    E
5    F
dtype: object

Note: This will be quite performant.

Timing comparisons:

In [2369]: dd = pd.concat([df] * 1000)

# Rivers' answers:

In [2369]: %timeit pd.DataFrame(dd.to_numpy().flatten())
95.6 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [2371]: %timeit dd.stack().reset_index(drop=True)
919 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# My answer:

In [2372]: %timeit pd.DataFrame(dd.to_numpy().ravel('F'))
62 µs ± 577 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58