Pandas - How can I stack columns based on datatype?

Question

If I have a dataframe with only two datatypes like below:

d = {'col1': [1, 2], 'col2': ['jack', 'bill'], 'col3': [4, 5], 'col4': ['megan', 'sarah']}
df = pd.DataFrame(data=d)
print(df)


   col1  col2  col3   col4
0     1  jack     4  megan
1     2  bill     5  sarah


print(df.dtypes)

col1     int64
col2    object
col3     int64
col4    object

Is there a way to stack these columns based only on data type? The end result would be:

   col1  col2
0     1  jack
1     2  bill
2     4  megan
3     5  sarah

It's not necessary for the final column names to remain the same.

score 4 · Accepted Answer · answered Aug 01 '19 at 15:02

4

This works with your sample data, not sure if it works with general data

(df.groupby(df.dtypes, axis=1)
   .apply(lambda x: (x.stack().reset_index(drop=True)))
)

Output

int64   object
0   1   jack
1   4   megan
2   2   bill
3   5   sarah

answered Aug 01 '19 at 15:02

Quang Hoang

146,074
10
56
74

Just looking into this a little more. The two datatypes I'm using in real dataset are actually datetime and float. I first needed to change `axis=1` to `columns` for my dataframe, but now I'm receiving `TypeError: unhashable type: 'list'` which is interesting because there are no lists inside the dataframe – Chris Macaluso Aug 01 '19 at 15:18
I wonder what'd happen if I there is a mismatch of # of columns for each dtype ;) – rafaelc Aug 01 '19 at 15:26
1

@rafaelc that's what I meant when I say it might not work in general. – Quang Hoang Aug 01 '19 at 15:29

score 3 · Answer 2 · answered Aug 01 '19 at 15:08

3

Why not give a chance for for loop

pd.DataFrame([ df.loc[:,df.dtypes==x].values.ravel() for x in df.dtypes.unique()]).T
Out[46]: 
   0      1
0  1   jack
1  4  megan
2  2   bill
3  5  sarah

answered Aug 01 '19 at 15:08

BENY

317,841
20
164
234

A good solution. However because of the loop, processing time for yours scales poorly with larger datasets compared to the other solution posted – G. Anderson Aug 01 '19 at 15:18
Also the real data set I'm working with has a datetime column for one of the dytpes. This performed some strange formatting to that column and condensed the rows into a single line for every day, kind of strange behavior. I haven't been able to dig into why yet – Chris Macaluso Aug 01 '19 at 15:20
1

@G.Anderson ummm you sure ? https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200 – BENY Aug 01 '19 at 15:20
I must have been mistaken on the cause, but the result is that is scales linearly with number of rows, while the other solution doesn't. On 220 rows: `36.9 ms`. On 2200 rows: `315 ms`. On 22000 rows: `3.02 s` – G. Anderson Aug 01 '19 at 15:41

score 2 · Answer 3 · answered Aug 01 '19 at 15:36

2

For mismatch in number of dtype columns, you may use the default constructor. Borrowing Quang's idea on groupby(axis=1),

pd.DataFrame(df.groupby(df.dtypes, axis=1).apply(lambda s: list(s.values.ravel())).tolist()).T

answered Aug 01 '19 at 15:36

rafaelc

57,686
15
58
82

Pandas - How can I stack columns based on datatype?

3 Answers3