Imagine I have pandas dataframe:
Column1 Column2
A D
B E
C F
How to get resulting Dataframe in this form?
Column
A
D
B
E
C
F
Imagine I have pandas dataframe:
Column1 Column2
A D
B E
C F
How to get resulting Dataframe in this form?
Column
A
D
B
E
C
F
EDIT: see the benchmark below for a slightly faster solution.
You can do this:
# Import pandas library
import pandas as pd
# The data
data = [["A", "D"], ["B", "E"], ["C", "F"]]
# Create DataFrame
df = pd.DataFrame(data, columns = ["Column1", "Column2"])
# Flatten and convert to DataFrame
new_df = pd.DataFrame(df.to_numpy().flatten())
print(df)
Output:
A
D
B
E
C
F
new_df
will be a pandas.DataFrame
.
Note the use of df.to_numpy()
too.
And as suggested by @Michael Szczesny you can do:
new_series = df.stack().reset_index(drop=True)
Which wil return a pandas.Series
.
Addded Benchmark:
Based on @Mayank Porwal's answer I add this benchmark results.
I used timeit.repeat with repeat = 7, number = 10000
.
Sorted from fastest to slowest:
new_df = pd.DataFrame(df.to_numpy().ravel('A')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('K')) # 51.0 µs
new_df = pd.DataFrame(df.to_numpy().ravel('F')) # 51.1 µs
new_df = pd.DataFrame(df.to_numpy().flatten()) # 52.6 µs
new_df = pd.DataFrame(df.to_numpy().ravel('C')) # 53.4 µs
new_series = df.stack().reset_index(drop=True) # 322.0 µs
Using numpy.ravel
is fastest mainly because it returns a view whereas numpy..to_numpy()
returns a copy.
For details about numpy.ravel
see: https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.ravel.html
In short, "A" will force to read the elements in Fortran-like index order if the array is Fortran contiguous in memory and with "K" it will read the elements in the order they occur in memory.
Use df.to_numpy
with numpy.ravel
:
In [2349]: x = pd.DataFrame(df.to_numpy().ravel('F'))
In [2350]: x
Out[2350]:
0
0 A
1 B
2 C
3 D
4 E
5 F
dtype: object
Note: This will be quite performant.
Timing comparisons:
In [2369]: dd = pd.concat([df] * 1000)
# Rivers' answers:
In [2369]: %timeit pd.DataFrame(dd.to_numpy().flatten())
95.6 µs ± 1.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [2371]: %timeit dd.stack().reset_index(drop=True)
919 µs ± 9.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# My answer:
In [2372]: %timeit pd.DataFrame(dd.to_numpy().ravel('F'))
62 µs ± 577 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)