Pandas 'describe' is not returning summary of all columns

Question

I am running 'describe()' on a dataframe and getting summaries of only int columns (pandas 14.0).

The documentation says that for object columns frequency of most common value, and additional statistics would be returned. What could be wrong? (no error message is returned by the way)

Edit:

I think it's how the function is set to behave on mixed column types in a dataframe. Although the documentation fails to mention it.

Example code:

df_test = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
df_test.dtypes
df_test.describe()
df_test['$a'] = df_test['$a'].astype(str)
df_test.describe()
df_test['$a'].describe()
df_test['$b'].describe()

My ugly work around in the meanwhile:

def my_df_describe(df):
    objects = []
    numerics = []
    for c in df:
        if (df[c].dtype == object):
            objects.append(c)
        else:
            numerics.append(c)

    return df[numerics].describe(), df[objects].describe()

It is not meaningful to calculate things like mean, std etc.. for object dtypes such as string and datetime, this is probably what you are seeing. You should see summary info for ints and float columns — EdChum, Jul 02 '14 at 07:03
Actually the problem I am noticing is only for mixed (int/object) dataframes... — user2808117, Jul 02 '14 at 07:24
No, different columns have different dtypes. I have added an example and it can be seen how describe after the type change differs from the one before it. I would of expected to have both stats (object and int) in the same dataframe with null values in parts where the statistics can not be computed for the column (e.g., std for object types) — user2808117, Jul 02 '14 at 07:44
I ran your code and don't understand what the problem is, like I said there is no point display stats for objects like string, I understand that if you called describe just on the string column it shows count and unique when it didn't before, you could raise this as a feature request but I imagine it would look ugly and unwieldly if you had a dataframe with lots of varying types and try to format the output to take into considertaion all the different dtypes — EdChum, Jul 02 '14 at 07:46
It's possible this is a bug perhaps, you'd have to either look at the source code or wait for one of the devs to look at this question and comment, not sure I remember if this behaviour changed between 11.0 to 14.0 — EdChum, Jul 02 '14 at 07:49
ok thank you, just wrote an ugly work around to suit my needs... — user2808117, Jul 02 '14 at 08:03

score 103 · Accepted Answer · edited Aug 01 '17 at 21:55

As of pandas v15.0, use the parameter, DataFrame.describe(include = 'all') to get a summary of all the columns when the dataframe has mixed column types. The default behavior is to only provide a summary for the numerical columns.

Example:

In[1]:

df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df.describe(include = 'all')

Out[1]:

        $a    $b
count   5   5.000000
unique  4   NaN
top     a   NaN
freq    2   NaN
mean    NaN 2.000000
std     NaN 1.581139
min     NaN 0.000000
25%     NaN 1.000000
50%     NaN 2.000000
75%     NaN 3.000000
max     NaN 4.000000

The numerical columns will have NaNs for summary statistics pertaining to objects (strings) and vice versa.

Summarizing only numerical or object columns

To call describe() on just the numerical columns use describe(include = [np.number])

To call describe() on just the objects (strings) using describe(include = ['O']).

In[2]:

df.describe(include = [np.number])

Out[3]:

         $b
count   5.000000
mean    2.000000
std     1.581139
min     0.000000
25%     1.000000
50%     2.000000
75%     3.000000
max     4.000000

In[3]:

df.describe(include = ['O'])

Out[3]:

    $a
count   5
unique  4
top     a
freq    2

All columns are *still* not displayed. – WestCoastProjects Jun 27 '17 at 15:35 — WestCoastProjects, Jun 27 '17 at 15:35

score 18 · Answer 2 · answered Jul 31 '18 at 06:51

18

pd.options.display.max_columns = DATA.shape[1] will work.

Here DATA is a 2d matrix, and above code will display stats vertically.

answered Jul 31 '18 at 06:51

MoeChen

719
6
5

score 16 · Answer 3 · answered Jul 01 '19 at 22:28

In addition to the data type issues discussed in the other answers, you might also have too many columns to display. If there are too many columns, the middle columns will be replaced with a total of three dots (...).

Other answers have pointed out that the include='all' parameter of describe can help with the data type issue. Another question asked, "How do I expand the output display to see more columns?" The solution is to modify the display.max_columns setting, which can even be done temporarily. For example, to display up to 40 columns of output from a single describe statement:

with pd.option_context('display.max_columns', 40):
    print(df.describe(include='all'))

RJT · Answer 4 · 2014-07-02T14:26:24.910

'describe()' on a DataFrame only works for numeric types. If you think you have a numeric variable and it doesn't show up in 'decribe()', change the type with:

df[['col1', 'col2']] = df[['col1', 'col2']].astype(float)

You could also create new columns for handling the numeric part of a mix type column, or convert strings to numbers using a dictionary and the map() function.

'describe()' on a non-numerical Series will give you some statistics (like count, unique and the most frequently occurring value).

score 3 · Answer 5 · answered Feb 02 '17 at 10:26

In addition to DataFrame.describe(include = 'all') one can also use Series.value_counts() for each categorical column:

In[1]:

df = pd.DataFrame({'$a':['a', 'b', 'c', 'd', 'a'], '$b': np.arange(5)})
df['$a'].value_counts()

Out[1]:
$a
a    2
d    1
b    1
c    1

score 2 · Answer 6 · answered Mar 16 '18 at 11:36

You can execute df_test.info() to get the list of datatypes your data frame contains.If your data frame contains only numerical columns than df_test.describe() will work perfectly fine.As by default, it provides the summary of numerical values. If you want the summary of your Object(String) features you can use df_test.describe(include=['O']).

Or in short, you can just use df_test.describe(include='all') to get summary of all the feature columns when your data frame has columns of various data types.

Pandas 'describe' is not returning summary of all columns

6 Answers6

Linked