81

I use the following code to create a numpy-ndarray. The file has 9 columns. I explicitly type each column:

dataset = np.genfromtxt("data.csv", delimiter=",",dtype=('|S1', float, float,float,float,float,float,float,int))

Now I would like to get some descriptive statistics for each column (min, max, stdev, mean, median, etc.). Shouldn't there be an easy way to do this?

I tried this:

from scipy import stats
stats.describe(dataset)

but this returns an error: TypeError: cannot perform reduce with flexible type

How can I get descriptive statistics of the created NumPy array?

nbro
  • 15,395
  • 32
  • 113
  • 196
beta
  • 5,324
  • 15
  • 57
  • 99
  • 1
    I think the error is because there are multiple `dtype`'s in your array. Especially a string would be problematic to describe statistically. Perhaps you could just loop over each of your columns, and describe the columns separately? – M.T Jul 26 '16 at 07:39
  • Thanks for the answer. How can I just access, for instance, the second column of the array? I tried `stats.describe(dataset[2])` but it yields the same error as in my OP. – beta Jul 26 '16 at 07:40
  • I suspect there is maybe something wrong with my array? How should a proper numpy-array based on a CSV file look like? mine looks like this, if I print it: http://pastebin.com/MYyqbSG0 – beta Jul 26 '16 at 07:44
  • Do you get the same error if you do `stats.describe(dataset[2].astype(float))`? – M.T Jul 26 '16 at 07:44
  • 2
    @beta If you are dealing with non-uniform data (looks like you are), you should have a look at `pandas` which is much more powerful for such kind of thing. – Holt Jul 26 '16 at 07:45
  • @M.T: Then I get `ValueError: could not convert string to float: 'F'` – beta Jul 26 '16 at 07:45
  • Use `unpack=True` in `genfromtxt` first – M.T Jul 26 '16 at 07:46
  • @Holt: I want to use numpy because I will use it in combination with `scikit-learn` and many examples are based on numpy... I dont think I deal with non-uniform data. I just have one column which is a string, because it's categorical data. – beta Jul 26 '16 at 07:46
  • @M.T: Same with `unpack=True` – beta Jul 26 '16 at 07:47
  • @beta Pandas' column are numpy arrays, so you can easily mix pandas and `scikit-learn`. – Holt Jul 26 '16 at 07:47
  • Look at `dataset.dtype`. See the fields that you defined? The field names? That's how you access the columns. You created a structured array. Read about those - either the docs or SO questions. – hpaulj Jul 26 '16 at 12:09
  • 1
    If no field names are given, the default field names are `'f0'`, `'f1'`, etc. So instead of `stats.describe(dataset[2])`, use `stats.describe(dataset['f2'])`. – Warren Weckesser Jul 26 '16 at 14:14

5 Answers5

57
import pandas as pd
import numpy as np

df_describe = pd.DataFrame(dataset)
df_describe.describe()

please note that dataset is your np.array to describe.

import pandas as pd
import numpy as np

df_describe = pd.DataFrame('your np.array')
df_describe.describe()
INNO TECH
  • 719
  • 6
  • 9
  • 12
    I think this by far the easiest option. You don't even need to create a new variable, you just write `pd.DataFrame(my_array).describe()`. – kyriakosSt Aug 07 '20 at 15:58
  • 1
    For the case the OP asks for, I think the code of this answer should be `pd.read_csv("data.csv").describe()` rather than implying that the data is loaded on a numpy array in the first place – kyriakosSt Aug 07 '20 at 16:00
  • 2
    Just one line, no for loop nothing. This is the best answer. – agent18 Dec 28 '20 at 10:33
33

This is not a pretty solution, but it gets the job done. The problem is that by specifying multiple dtypes, you are essentially making a 1D-array of tuples (actually np.void), which cannot be described by stats as it includes multiple different types, incl. strings.

This could be resolved by either reading it in two rounds, or using pandas with read_csv.

If you decide to stick to numpy:

import numpy as np
a = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=range(1,9))
s = np.genfromtxt('sample.txt', delimiter=",",unpack=True,usecols=0,dtype='|S1')

from scipy import stats
for arr in a: #do not need the loop at this point, but looks prettier
    print(stats.describe(arr))
#Output per print:
DescribeResult(nobs=6, minmax=(0.34999999999999998, 0.70999999999999996), mean=0.54500000000000004, variance=0.016599999999999997, skewness=-0.3049304880932534, kurtosis=-0.9943046886340534)

Note that in this example the final array has dtype as float, not int, but can easily (if necessary) be converted to int using arr.astype(int)

M.T
  • 4,917
  • 4
  • 33
  • 52
  • This use of `usecols` is good. I don't think you need `unpack`. – hpaulj Jul 26 '16 at 17:02
  • @hpaulj If one accesses the data the way you show in your answer (which I think deserves to be the accepted answer), then unpack is unnecessary. Still, in my experience, both with `genfromtxt` and `loadtxt` I find I always work with columns (ie. the transposed of the normal output) when dealing with scientific data from csv-like documents. It is also less easy to loop over the recarray fields. – M.T Jul 27 '16 at 07:04
  • I have opened a spin-off question for nested structures, see https://stackoverflow.com/questions/62385252/how-can-i-get-the-statistics-of-all-columns-including-those-with-a-nested-struct/62385253#62385253 – questionto42 Jun 15 '20 at 09:31
7

The question of how to deal with mixed data from genfromtxt comes up often. People expect a 2d array, and instead get a 1d that they can't index by column. That's because they get a structured array - with different dtype for each column.

All the examples in the genfromtxt doc show this:

>>> s = StringIO("1,1.3,abcde")
>>> data = np.genfromtxt(s, dtype=[('myint','i8'),('myfloat','f8'),
... ('mystring','S5')], delimiter=",")
>>> data
array((1, 1.3, 'abcde'),
      dtype=[('myint', '<i8'), ('myfloat', '<f8'), ('mystring', '|S5')])

But let me demonstrate how to access this kind of data

In [361]: txt=b"""A, 1,2,3
     ...: B,4,5,6
     ...: """
In [362]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,int,float,int'))
In [363]: data
Out[363]: 
array([(b'A', 1, 2.0, 3), (b'B', 4, 5.0, 6)], 
      dtype=[('f0', 'S1'), ('f1', '<i4'), ('f2', '<f8'), ('f3', '<i4')])

So my array has 2 records (check the shape), which are displayed as tuples in a list.

You access fields by name, not by column number (do I need to add a structured array documentation link?)

In [364]: data['f0']
Out[364]: 
array([b'A', b'B'], 
      dtype='|S1')
In [365]: data['f1']
Out[365]: array([1, 4])

In a case like this might be more useful if I choose a dtype with 'subarrays'. This a more advanced dtype topic

In [367]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=('S1,(3)float'))
In [368]: data
Out[368]: 
array([(b'A', [1.0, 2.0, 3.0]), (b'B', [4.0, 5.0, 6.0])], 
      dtype=[('f0', 'S1'), ('f1', '<f8', (3,))])
In [369]: data['f1']
Out[369]: 
array([[ 1.,  2.,  3.],
       [ 4.,  5.,  6.]])

The character column is still loaded as S1, but the numbers are now in a 3 column array. Note that they are all float (or int).

In [371]: from scipy import stats
In [372]: stats.describe(data['f1'])
Out[372]: DescribeResult(nobs=2, 
   minmax=(array([ 1.,  2.,  3.]), array([ 4.,  5.,  6.])),
   mean=array([ 2.5,  3.5,  4.5]), 
   variance=array([ 4.5,  4.5,  4.5]), 
   skewness=array([ 0.,  0.,  0.]), 
   kurtosis=array([-2., -2., -2.]))
hpaulj
  • 221,503
  • 14
  • 230
  • 353
3

Official Scipy Documentation Example

#INPUT
from scipy import stats
a = np.arange(10)
stats.describe(a)

#OUTPUT
DescribeResult(nobs=10, minmax=(0, 9), mean=4.5, variance=9.166666666666666,
               skewness=0.0, kurtosis=-1.2242424242424244)

#INPUT
b = [[1, 2], [3, 4]]
stats.describe(b)

#OUTPUT
DescribeResult(nobs=2, minmax=(array([1, 2]), array([3, 4])),
               mean=array([2., 3.]), variance=array([2., 2.]),
               skewness=array([0., 0.]), kurtosis=array([-2., -2.]))
sogu
  • 2,738
  • 5
  • 31
  • 90
0

For those who need fast computations, scipy + numpy are faster than pandas:

# scipy + numpy
def get_stats(v):
    res = stats.describe(v)
    return np.concatenate([
        [
            res.minmax[0],
            res.minmax[1],
            res.mean,
            res.variance,
            res.skewness,
            res.kurtosis
        ],
        np.percentile(v, q=[10, 25, 50, 75, 90])
    ])
%timeit get_stats(np.arange(100))
639 µs ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%timeit pd.Series(np.arange(100)).describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])
830 µs ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Plus in the describe() pandas does not extract kurtosis and skewness


Warning: using pd.DataFrame(array).describe() is slower:

%timeit pd.DataFrame(np.arange(100)).describe(percentiles=[0.1, 0.25, 0.5, 0.75, 0.9])
1.43 ms ± 75.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
etiennedm
  • 409
  • 1
  • 3
  • 9