3

I have a problem where I need to convert a pandas dataframe into an array of list of lists.

Sample:

import pandas as pd
df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])

I know there is the as_matrix() function which returns below:

df.as_matrix():
# result:array([[1, 2, 3],
                [2, 2, 4],
                [3, 2, 4]])

However, I require something in this format

  [array([[1], [2], [3]]),
   array([[2], [2], [4]],
   array([[3], [2], [4]])]

IE. I need a list of arrays containing list of lists where the inner most list contains a single element and the outer most list in the array represents the row of the dataframe. The effect of this is that it basically vectorizes each row of the dataframe into a vector of dimension 3.

This is useful especially when I need to do matrix / vector operations in numpy and currently the data source I have is in .csv format and I am struggling to find a way to convert a dataframe into a vector.

halfer
  • 19,824
  • 17
  • 99
  • 186
SeekingAlpha
  • 7,489
  • 12
  • 35
  • 44

2 Answers2

4

Extract the underlying array data , add a newaxis along the last one and then split along the first axis with np.vsplit -

np.vsplit(df.values[...,None],df.shape[0])

Sample run -

In [327]: df
Out[327]: 
   0  1  2
0  1  2  3
1  2  2  4
2  3  2  4

In [328]: expected_output = [np.array([[1], [2], [3]]),
     ...: np.array([[2], [2], [4]]),
     ...: np.array([[3], [2], [4]])]

In [329]: expected_output
Out[329]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

In [330]: np.vsplit(df.values[...,None],df.shape[0])
Out[330]: 
[array([[[1],
         [2],
         [3]]]), array([[[2],
         [2],
         [4]]]), array([[[3],
         [2],
         [4]]])]

If you are working with NumPy funcs, then in most scenarios, you should be able to do away with the splitting and directly use the extended array version.

Now, under the hoods np.vsplit makes use of np.array_split and that's basically a loop. So, a bit more performant way would be to avoid the function overhead, like so -

np.array_split(df.values[...,None],df.shape[0])

Note that this would have one extra dimension than as listed in the expected output. If you want a squeezed out version, we could use a list comprehension on the new-axis extended array version, like so -

In [357]: [i for i in df.values[...,None]]
Out[357]: 
[array([[1],
        [2],
        [3]]), array([[2],
        [2],
        [4]]), array([[3],
        [2],
        [4]])]

Thus, another way would be to add the new axis within the looping -

[i[...,None] for i in df.values]
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

First convert your DataFrame to a matrix. Then add a dimension and convert it to a list.

Try:

df = pd.DataFrame([[1,2,3],[2,2,4],[3,2,4]])
my_matrix = df.as_matrix()
my_list_of_arrays_of_list_lists = list(np.expand_dims(my_matrix, axis=2))

my_list_of_arrays_of_list_lists represents what you are looking for and gives you:

Out[42]: [array([[1],[2],[3]]),
          array([[2],[2],[4]]),
          array([[3],[2],[4]])]
Franz
  • 623
  • 8
  • 14