2

I have a weird problem, I have the following dataframe:

embedding
0   [0.0, 0.0, 0.0, 0.6223578453063965, 0.0, 0.270...
1   [0.0, 0.0, 0.0, 0.6223578453063965, 0.0, 0.270...
2   [0.0, 0.0, 0.0, 0.6223578453063965, 0.0, 0.270..

It's a dataframe with one columned named embedding. It's about 100 item array for each row. They are all the same size for each row.

How can I expand it so each item in the array its own column in a dataframe? Is it possible? or do I have to extract the numpy array and create a dataframe from the nested array?

Update: I don't have names for all columns. It's not important to me. What is important is that the order be preserved from the numpy array.

Update2: as per comment -

print(Xtest_e1.head(2).to_dict())
{'embedding': {0: array([0.        , 0.        , 0.        , 0.62235785, 0.        ,
       0.27049118, 0.        , 0.31094068, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.4330532 , 0.        ,
       0.        , 0.25157961, 0.        , 0.        , 0.        ,
       0.40683705, 0.01569915, 0.        , 0.        , 0.        ,
       0.13090582, 0.        , 0.49955425, 0.06970194, 0.29155406,
       0.        , 0.        , 0.27342197, 0.        , 0.        ,
       0.        , 0.04415211, 0.        , 0.03908829, 0.        ,
       0.07673171, 0.33199945, 0.        , 0.51759815, 0.        ,
       0.47191489, 0.45380819, 0.13475986, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.08000553,
       0.        , 0.02991109, 0.        , 0.50515431, 0.        ,
       0.24663273, 0.        , 0.50839704, 0.        , 0.        ,
       0.05281948, 0.44884402, 0.        , 0.44542992, 0.15376966,
       0.        , 0.        , 0.        , 0.39128256, 0.49497205,
       0.        , 0.        ]), 1: array([0.        , 0.        , 0.        , 0.62235785, 0.        ,
       0.27049118, 0.        , 0.31094068, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.4330532 , 0.        ,
       0.        , 0.25157961, 0.        , 0.        , 0.        ,
       0.40683705, 0.01569915, 0.        , 0.        , 0.        ,
       0.13090582, 0.        , 0.49955425, 0.06970194, 0.29155406,
       0.        , 0.        , 0.27342197, 0.        , 0.        ,
       0.        , 0.04415211, 0.        , 0.03908829, 0.        ,
       0.07673171, 0.33199945, 0.        , 0.51759815, 0.        ,
       0.47191489, 0.45380819, 0.13475986, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.08000553,
       0.        , 0.02991109, 0.        , 0.50515431, 0.        ,
       0.24663273, 0.        , 0.50839704, 0.        , 0.        ,
       0.05281948, 0.44884402, 0.        , 0.44542992, 0.15376966,
       0.        , 0.        , 0.        , 0.39128256, 0.49497205,
       0.        , 0.        ])}}
Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • 1
    Duplicate of https://stackoverflow.com/questions/35491274/pandas-split-column-of-lists-into-multiple-columns ? – Nick ODell Jun 09 '21 at 21:47
  • @NickODell but that solution requires me to know the names of the columns in advance? I do not care about the columns name but I don't want to name each of the column names..they can be column1, column2, etc..As only the order matters. – Lostsoul Jun 09 '21 at 21:48
  • If you need to generate names for the columns, you could use a list expression like `['column%d' % i for i in range(100)]`. – Nick ODell Jun 09 '21 at 21:52
  • can you add in your dataframe as a dict? just the first 1-2 rows `print(df.head(2).to_dict())` – Umar.H Jun 09 '21 at 21:54
  • 1
    @Umar.H Done. Let me know if you need any other info. – Lostsoul Jun 09 '21 at 22:03
  • awesome, the output is still a little unclear to me, but can you try `s = df.stack().explode().reset_index(1)`;`s['level_1'] = s['level_1'] + s.groupby(level=0).cumcount().astype(str)`;`s.set_index('level_1',append=True).unstack(1)`? – Umar.H Jun 09 '21 at 22:10
  • @Umar.H I think that did it. I'm testing now. The outcome looks like what I wanted so far. – Lostsoul Jun 09 '21 at 22:18

1 Answers1

2

Is it what you expect:

>>> pd.DataFrame(Xtest_e1["embedding"].tolist()).add_prefix("c")

    c0   c1   c2        c3   c4  ...  c72       c73       c74  c75  c76
0  0.0  0.0  0.0  0.622358  0.0  ...  0.0  0.391283  0.494972  0.0  0.0
1  0.0  0.0  0.0  0.622358  0.0  ...  0.0  0.391283  0.494972  0.0  0.0

[2 rows x 77 columns]
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • I get a weird error - TypeError: 'NoneType' object is not iterable – Lostsoul Jun 09 '21 at 22:36
  • This is the shape and it's what I posted above. (43206, 1). The column type is 'object' does that make a difference - Xtest_e1.columns results in Index(['embedding'], dtype='object') – Lostsoul Jun 09 '21 at 22:42
  • pd.DataFrame(Xtest_e1["embedding"].to_dict()).add_prefix("c") – Lostsoul Jun 10 '21 at 00:32
  • Are you sure `embedding` column is a list? This snippet of code seems to work with your sample. – Corralien Jun 10 '21 at 04:35