4

I have a dataframe with several columns that look like this:

              Age                         G                      GS 
INDEX1  [27, 25, 22, 30, 30]    [76, 79, 80, 76, 77]    [76, 79, 80, 76, 77]    
INDEX2  [24, 23, 21, 32, 34]    [77, 76, 81, 75, 77]    [77, 76, 81, 75, 77]    

How do I go about splitting up all the lists into their own separate columns? Ideally my data frame will look like this once I have finished:

       Age   Age1  Age2   Age3   Age4   G    G1   G2   G3   G4
INDEX1  27    25    22     30     30    76   79   80   76   77  ...
... 

If it helps I did convert a dictionary to this data frame. I have tried searching and implementing several different similar solutions on stack but none of them seem to work. This solution converts properly but for some reason creates two NaN columns. If anyone know how to perform this on the entire dataframe, I can drop the extra NaN columns:

df1 = pd.DataFrame(converted['Age'].values.tolist())
df1


    0   1   2   3    4       5   6
0   27  25  22  30  30.0    NaN NaN
1   31  29  33  27  33.0    NaN NaN
2   22  21  26  21  33.0    NaN NaN
3   29  24  31  33  27.0    NaN NaN
4   30  21  31  31  32.0    NaN NaN
... ... ... ... ... ... ... ...
1727    28  27  28  20  26.0    NaN NaN
1728    20  29  27  24  20.0    NaN NaN
1729    30  31  34  25  26.0    NaN NaN
1730    31  26  34  21  21.0    NaN NaN
1731    22  24  20  28  25.0    NaN NaN

There are a few other solutions I have tried but errors arose with the Age column, it may have something to do with hidden values but I am not sure.

df2 = pd.DataFrame()

for col in converted.columns:
    # names of new columns
    feature_columns  = [ "{col}_feature1".format(col=col), "{col}_feature2".format(col=col), "{col}_feature3".format(col=col)
                       , "{col}_feature4".format(col=col)
                       , "{col}_feature5".format(col=col)]
    # split current column
    df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series({ feature_columns[0]: s[0],
                                                                   feature_columns[1]: s[1],
                                                                   feature_columns[2]: s[2],
                                                                   feature_columns[3]: s[3],
                                                                   feature_columns[4]: s[4]} ) )

print (df2)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Age'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-180-53ed0043f9d8> in <module>
      7                        , "{col}_feature5".format(col=col)]
      8     # split current column
----> 9     df2[ feature_columns ] = df[ col ].apply(lambda s: pd.Series({ feature_columns[0]: s[0],
     10                                                                    feature_columns[1]: s[1],
     11                                                                    feature_columns[2]: s[2],

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2978             if self.columns.nlevels > 1:
   2979                 return self._getitem_multilevel(key)
-> 2980             indexer = self.columns.get_loc(key)
   2981             if is_integer(indexer):
   2982                 indexer = [indexer]

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    377             except ValueError:
    378                 raise KeyError(key)
--> 379         return super().get_loc(key, method=method, tolerance=tolerance)
    380 
    381     @Appender(_index_shared_docs["get_indexer"])

/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: 'Age'

Edit: I tried using solution listed here: Pandas split column of lists into multiple columns

And it didn't work for me. Thank you for the suggestion!

Neel
  • 75
  • 4
  • @G.Anderson Thank you for the response and welcome!! I tried implementing explode, but it puts each value in a separate row. This is actually where I started from, and I am trying to align my data horizontally. Is it possible to explode into new columns? – Neel Feb 06 '20 at 16:47
  • Apologies, I misunderstood the request – G. Anderson Feb 06 '20 at 16:50

1 Answers1

5

Use:

new_df = pd.concat([pd.DataFrame(col.tolist(), index = df.index).add_prefix(i) 
                    for i, col in df.items()], axis = 1)
print(new_df)
        Age0  Age1  Age2  Age3  Age4  G0  G1  G2  G3  G4  GS0  GS1  GS2  GS3  \
INDEX1    27    25    22    30    30  76  79  80  76  77   76   79   80   76   
INDEX2    24    23    21    32    34  77  76  81  75  77   77   76   81   75   

        GS4  
INDEX1   77  
INDEX2   77  

Maybe better only set the index one time

new_df = pd.concat([pd.DataFrame(col.tolist()).add_prefix(i) 
                    for i, col in df.items()], axis = 1)
new_df.index = df.index
ansev
  • 30,322
  • 5
  • 17
  • 31
  • 1
    This is EXACTLY what I have been looking for. Can't believe it was that simple!! THANK YOU so much!! – Neel Feb 06 '20 at 16:52