23

I have a 2D Numpy array that I would like to put in a pandas Series (not a DataFrame):

>>> import pandas as pd
>>> import numpy as np
>>> a = np.zeros((5, 2))
>>> a
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

But this throws an error:

>>> s = pd.Series(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 227, in __init__
    raise_cast_failure=True)
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 2920, in _sanitize_array
    raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional

It is possible with a hack:

>>> s = pd.Series(map(lambda x:[x], a)).apply(lambda x:x[0])
>>> s
0    [0.0, 0.0]
1    [0.0, 0.0]
2    [0.0, 0.0]
3    [0.0, 0.0]
4    [0.0, 0.0]

Is there a better way?

zemekeneng
  • 1,660
  • 2
  • 15
  • 26
  • By default Pandas gets the shape of the np array and allocates DataFrame accordingly. So you need to fool the shape of your np array... Which is what you "hack" does, albeit one row at a time. – Kartik Aug 09 '16 at 00:30
  • Any thoughts on how to wrap each row in a list using a matrix operation? – zemekeneng Aug 09 '16 at 00:37
  • 2
    Just out of curiosity, why would you want this? – juanpa.arrivillaga Aug 09 '16 at 02:18
  • You might want to try tuples. x:y with seperator of ":", But be warned that numpy will default to 'object' calc mode vs "C" calc when it see objects in the matrix. – Merlin Aug 09 '16 at 02:50
  • 1
    @juanpa.arrivillaga Machine learning. I would like to a append a vectorized corpus of texts to the DataFrame that holds the labels and other features. This way it is easier to filter the whole dataset, and is particularly handy with smaller datasets whose subset may not have a complete set of labels. I want all the columns in a series because I don't want to keep track of a column for every vocabulary in a DataFrame. If you have a better system to manage this, I would love to hear about it! – zemekeneng Aug 09 '16 at 03:09

2 Answers2

27

Well, you can use the numpy.ndarray.tolist function, like so:

>>> a = np.zeros((5,2))
>>> a
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])
>>> a.tolist()
[[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]
>>> pd.Series(a.tolist())
0    [0.0, 0.0]
1    [0.0, 0.0]
2    [0.0, 0.0]
3    [0.0, 0.0]
4    [0.0, 0.0]
dtype: object

EDIT:

A faster way to accomplish a similar result is to simply do pd.Series(list(a)). This will make a Series of numpy arrays instead of Python lists, so should be faster than a.tolist which returns a list of Python lists.

bpachev
  • 2,162
  • 15
  • 17
3
 pd.Series(list(a))

is consistently slower than

pd.Series(a.tolist())

tested 20,000,000 -- 500,000 rows

a = np.ones((500000,2))

showing only 1,000,000 rows:

%timeit pd.Series(list(a))
1 loop, best of 3: 301 ms per loop

%timeit pd.Series(a.tolist())
1 loop, best of 3: 261 ms per loop
Merlin
  • 24,552
  • 41
  • 131
  • 206