2

If I have a dataframe with only two datatypes like below:

d = {'col1': [1, 2], 'col2': ['jack', 'bill'], 'col3': [4, 5], 'col4': ['megan', 'sarah']}
df = pd.DataFrame(data=d)
print(df)


   col1  col2  col3   col4
0     1  jack     4  megan
1     2  bill     5  sarah


print(df.dtypes)

col1     int64
col2    object
col3     int64
col4    object

Is there a way to stack these columns based only on data type? The end result would be:

   col1  col2
0     1  jack
1     2  bill
2     4  megan
3     5  sarah

It's not necessary for the final column names to remain the same.

Chris Macaluso
  • 1,372
  • 2
  • 14
  • 33

3 Answers3

4

This works with your sample data, not sure if it works with general data

(df.groupby(df.dtypes, axis=1)
   .apply(lambda x: (x.stack().reset_index(drop=True)))
)

Output

int64   object
0   1   jack
1   4   megan
2   2   bill
3   5   sarah
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Just looking into this a little more. The two datatypes I'm using in real dataset are actually datetime and float. I first needed to change `axis=1` to `columns` for my dataframe, but now I'm receiving `TypeError: unhashable type: 'list'` which is interesting because there are no lists inside the dataframe – Chris Macaluso Aug 01 '19 at 15:18
  • I wonder what'd happen if I there is a mismatch of # of columns for each dtype ;) – rafaelc Aug 01 '19 at 15:26
  • 1
    @rafaelc that's what I meant when I say it might not work in general. – Quang Hoang Aug 01 '19 at 15:29
3

Why not give a chance for for loop

pd.DataFrame([ df.loc[:,df.dtypes==x].values.ravel() for x in df.dtypes.unique()]).T
Out[46]: 
   0      1
0  1   jack
1  4  megan
2  2   bill
3  5  sarah
BENY
  • 317,841
  • 20
  • 164
  • 234
  • A good solution. However because of the loop, processing time for yours scales poorly with larger datasets compared to the other solution posted – G. Anderson Aug 01 '19 at 15:18
  • Also the real data set I'm working with has a datetime column for one of the dytpes. This performed some strange formatting to that column and condensed the rows into a single line for every day, kind of strange behavior. I haven't been able to dig into why yet – Chris Macaluso Aug 01 '19 at 15:20
  • 1
    @G.Anderson ummm you sure ? https://stackoverflow.com/questions/54028199/are-for-loops-in-pandas-really-bad-when-should-i-care/54028200#54028200 – BENY Aug 01 '19 at 15:20
  • I must have been mistaken on the cause, but the result is that is scales linearly with number of rows, while the other solution doesn't. On 220 rows: `36.9 ms`. On 2200 rows: `315 ms`. On 22000 rows: `3.02 s` – G. Anderson Aug 01 '19 at 15:41
2

For mismatch in number of dtype columns, you may use the default constructor. Borrowing Quang's idea on groupby(axis=1),

pd.DataFrame(df.groupby(df.dtypes, axis=1).apply(lambda s: list(s.values.ravel())).tolist()).T
rafaelc
  • 57,686
  • 15
  • 58
  • 82