8

I'm confronted with a nauseating issue dealing with the to_csv() function for DataFrame in pandas 0.14.0. I have a list of long numpy arrays as one column in the DataFrame df:

>>> df['col'][0]    
array([   0,    1,    2, ..., 9993, 9994, 9995])
>>> len(df['col'][0])
46889
>>> type(df['col'][0][0])
<class 'numpy.int64'>

If I save df by

df.to_csv('df.csv')

and open df.csv in LibreOffice, the corresponding column shows up like this:

[ 0,    1,    2, ..., 9993, 9994, 9995]

rather than listing all the 46889 numbers. I'm wondering if there's an approach that can force to_csv to list all numbers rather than showing up ellipsis?

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 4 columns):
pair          2 non-null object
ARXscore      2 non-null float64
bselect       2 non-null bool
col           2 non-null object
dtypes: bool(1), float64(1), object(2)
zhh210
  • 388
  • 4
  • 12
  • 1
    What does the output of `df.info()` look like? The pasted output with spacing in the array entries like that seems odd. – cwharland Aug 19 '14 at 03:41
  • Adding comment here is not well-formated so I revised the question to include df.info() – zhh210 Aug 19 '14 at 04:01
  • 1
    This is kind of an odd way to store the data, why is the numpy array being used as an object? – U2EF1 Aug 19 '14 at 04:09
  • You're storing the array as a string so the output you are seeing is expected. If you want to out put an array you need to get that actual array not a truncated string of it. – cwharland Aug 20 '14 at 13:28

2 Answers2

5

In some sense this is a duplicate of printing the entire numpy array, since to_csv simply asks each item in your DataFrame for it's __str__, so you need to see how that prints:

In [11]: np.arange(10000)
Out[11]: array([   0,    1,    2, ..., 9997, 9998, 9999])

In [12]: np.arange(10000).__str__()
Out[12]: '[   0    1    2 ..., 9997 9998 9999]'

as you can see when it's over a certain threshold it prints with ellipsis, set it to NaN:

np.set_printoptions(threshold='nan')

To give an example:

In [21]: df = pd.DataFrame([[np.arange(10000)]])

In [22]: df  # Note: pandas printing is different!!
Out[22]:
                                                   0
0  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...

In [23]: s = StringIO()

In [24]: df.to_csv(s)

In [25]: s.getvalue()  # ellipsis
Out[25]: ',0\n0,"[   0    1    2 ..., 9997 9998 9999]"\n'

Once changed to_csv records the entire array:

In [26]: np.set_printoptions(threshold='nan')

In [27]: s = StringIO()

In [28]: df.to_csv(s)

In [29]: s.getvalue()  # no ellipsis (it's all there)
Out[29]: ',0\n0,"[   0    1    2    3    4    5    6    7    8    9   10   11   12   13   14\n   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29\n   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44\n   45   46   47   48   49   50   51   52   53   54   55   56   57   58   59\n   60   61  # the whole thing is here...

As mentioned this is not usually a good choice of structure for a DataFrame (numpy arrays in object columns) as you lose much of the pandas speed/efficiency/magic sauce.

Community
  • 1
  • 1
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
1
np.set_printoptions(threshold='nan')

works not with the latest version. Use:

import sys
import numpy
numpy.set_printoptions(threshold=sys.maxsize)
Florida Man
  • 2,021
  • 3
  • 25
  • 43