Put a 2d Array into a Pandas Series

Question

I have a 2D Numpy array that I would like to put in a pandas Series (not a DataFrame):

>>> import pandas as pd
>>> import numpy as np
>>> a = np.zeros((5, 2))
>>> a
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

But this throws an error:

>>> s = pd.Series(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 227, in __init__
    raise_cast_failure=True)
  File "/miniconda/envs/pyspark/lib/python3.4/site-packages/pandas/core/series.py", line 2920, in _sanitize_array
    raise Exception('Data must be 1-dimensional')
Exception: Data must be 1-dimensional

It is possible with a hack:

>>> s = pd.Series(map(lambda x:[x], a)).apply(lambda x:x[0])
>>> s
0    [0.0, 0.0]
1    [0.0, 0.0]
2    [0.0, 0.0]
3    [0.0, 0.0]
4    [0.0, 0.0]

Is there a better way?

By default Pandas gets the shape of the np array and allocates DataFrame accordingly. So you need to fool the shape of your np array... Which is what you "hack" does, albeit one row at a time. — Kartik, Aug 09 '16 at 00:30
Any thoughts on how to wrap each row in a list using a matrix operation? — zemekeneng, Aug 09 '16 at 00:37
You might want to try tuples. x:y with seperator of ":", But be warned that numpy will default to 'object' calc mode vs "C" calc when it see objects in the matrix. — Merlin, Aug 09 '16 at 02:50
@juanpa.arrivillaga Machine learning. I would like to a append a vectorized corpus of texts to the DataFrame that holds the labels and other features. This way it is easier to filter the whole dataset, and is particularly handy with smaller datasets whose subset may not have a complete set of labels. I want all the columns in a series because I don't want to keep track of a column for every vocabulary in a DataFrame. If you have a better system to manage this, I would love to hear about it! — zemekeneng, Aug 09 '16 at 03:09

bpachev · Accepted Answer · 2016-08-09T03:36:25.973

27

Well, you can use the numpy.ndarray.tolist function, like so:

>>> a = np.zeros((5,2))
>>> a
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])
>>> a.tolist()
[[0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0], [0.0, 0.0]]
>>> pd.Series(a.tolist())
0    [0.0, 0.0]
1    [0.0, 0.0]
2    [0.0, 0.0]
3    [0.0, 0.0]
4    [0.0, 0.0]
dtype: object

EDIT:

A faster way to accomplish a similar result is to simply do pd.Series(list(a)). This will make a Series of numpy arrays instead of Python lists, so should be faster than a.tolist which returns a list of Python lists.

edited Aug 09 '16 at 03:36

answered Aug 09 '16 at 01:05

bpachev

2,162
15
17

Thanks, that one is faster for under about 25 columns, but much slower if there are hundreds or thousands. – zemekeneng Aug 09 '16 at 01:21
4

I found another approach that is faster. See the edited answer. – bpachev Aug 09 '16 at 03:33
Nice, thanks for sticking with it! It is the fastest in every scenario. – zemekeneng Aug 09 '16 at 03:55

score 3 · Answer 2 · answered Aug 09 '16 at 05:27

3

 pd.Series(list(a))

is consistently slower than

pd.Series(a.tolist())

tested 20,000,000 -- 500,000 rows

a = np.ones((500000,2))

showing only 1,000,000 rows:

%timeit pd.Series(list(a))
1 loop, best of 3: 301 ms per loop

%timeit pd.Series(a.tolist())
1 loop, best of 3: 261 ms per loop

answered Aug 09 '16 at 05:27

Merlin

24,552
41
131
206

1

That is true when you have only two columns. Try a couple thousand and see what happens. – bpachev Aug 09 '16 at 13:19

Put a 2d Array into a Pandas Series

2 Answers2

Linked