0

I have a pandas dataframe full of tuple (it could be the same with arrays) and I would like to split all the columns into even more columns (each array or tuple has the same length). Let's take this as an example:

df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], df.columns=['column0', 'column1'])

which outputs:

    column0 column1  
0   (1, 2)   (3, 4)  
1   (5, 6)   (7, 8)  

I tried to build over this solution here(https://stackoverflow.com/a/16245109/4218755) using derivates off the expression:

df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})

like

df.column0.apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]})

which outputs:

       feature1  feature2  
 0         1         2   
 1         5         6   

This is the desired behavior. So it works well, but if I happen to try to use

 df2=df[df.columns].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))

then df2 is:

         colonne0 colonne1
 feature1   (1, 2)   (3, 4)   
 feature2   (5, 6)   (7, 8)  

which is obviously wrong. I can't either apply on df, it output the same result as df2.

How to apply such splitting technique to a whole dataframe, and are there alternatives? Thanks

Community
  • 1
  • 1
Ando Jurai
  • 1,003
  • 2
  • 14
  • 29
  • I am approaching the solution with : df2=df.applymap(lambda s: pd.Series({'feat1':s[0],'feat2': s[1]})). It outputs ; colonne0 colonne1 0 feat1 1 feat2 2 dtype: int64 feat1 3 feat2 4 dtype: int64 1 feat1 5 feat2 6 dtype: int64 feat1 7 feat2 8 dtype: int64 but I am stuck with this index (and df2.reset_index is not working) – Ando Jurai Jul 05 '16 at 10:21

3 Answers3

1

IIUC you can use:

df=pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
print (df)
  column0 column1
0  (1, 2)  (3, 4)
1  (5, 6)  (7, 8)


for col in df.columns: 
    df[col]=df[col].apply(lambda s: pd.Series({'feature1':s[0], 'feature2':s[1]}))

print (df)
   column0  column1
0        1        3
1        5        7
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thanks for the idea. Actually I would expect to have 2,6 in a second column and 4,8 in a fourth, too. I don't understand why your code doesn't output this, I expected this by reading it. I did not precised it either, but I would like to avoid loops if possible (I take the idea as valuable, thought, if no alternative is available). – Ando Jurai Jul 05 '16 at 10:26
1

You could extract the DataFrame values as a NumPy array, use IT.chain.from_iterable to extract the ints from the tuples, and then reshape and rebuild the array into a new DataFrame:

import itertools as IT
import numpy as np
import pandas as pd
df = pd.DataFrame([[(1,2),(3,4)],[(5,6),(7,8)]], columns=['column0', 'column1'])
arr = df.values
arr = np.array(list(IT.chain.from_iterable(arr))).reshape(len(df), -1)
result = pd.DataFrame(arr)

yields

   0  1  2  3
0  1  2  3  4
1  5  6  7  8

By the way, you might have fallen into an XY-trap -- you're asking for X when you really should be looking for Y. Instead of trying to transform df into result, it might be easier to build the desired DataFrame, result, from the original data source.

For example, if your original data is a list of lists of tuples:

data = [[(1,2),(3,4)],[(5,6),(7,8)]]

Then the desired DataFrame could be built using

df = pd.DataFrame(np.array(data).reshape(2,-1))
#    0  1  2  3
# 0  1  2  3  4
# 1  5  6  7  8

Once you have non-NumPy-native data types in your DataFrame (such as tuples), you are doomed to using at least one Python loop to extract the ints from the tuples. (I'm regarding things like df.apply(func) and list(IT.chain.from_iterable(arr)) as essentially Python loops since they work at Python-loop speed.)

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thanks for your help! actually I believed that applymap and apply performed some kind of vectorized operations, that's why I preferred to start from the tuples df instead of reusing the original df from which it was made from (which was pretty simple to convert). Finally I had myself used something along the line of making 2 copies, then using splitframe1.applymap(lambda x: x[0]) and splitframe2.applymap(lambda x: x[1]) and merging after renaming columns. In any case, your solution and additional infos are worth acception your answer as the best. – Ando Jurai Jul 05 '16 at 12:10
1

You may iterate over each column you want to split and assign the new columns to your DataFrame:

import pandas as pd

df=pd.DataFrame( [ [ (1,2), (3,4)],
                   [ (5,6), (7,8)] ], columns=['column0', 'column1'])

# empty DataFrame
df2 = pd.DataFrame()

for col in df.columns:
    # names of new columns
    feature_columns  = [ "{col}_feature1".format(col=col), "{col}_feature2".format(col=col) ]
    # split current column
    df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series({ feature_columns[0]: s[0],
                                                                   feature_columns[1]: s[1]} ) )

print df2

which gives

  column0_feature1  column0_feature2  column1_feature1  column2_feature2
0                1                 2                 3                 4 
1                5                 6                 7                 8
desiato
  • 1,122
  • 1
  • 9
  • 16
  • Very clever, thanks. It won't be the accepted answer because the itertools one looks better at avoiding loops, but it seems valuable. – Ando Jurai Jul 05 '16 at 11:49