-2

I'm trying to find the average for all the values in one of the columns in my dataset. I did df["column"].mean() but It's giving me a ridiculously big number that does not make sense considering how small my values are. The min() and max() function works fine, however.

Here is what I'm talking about.

To clarify, left side of the output in the first cell are the indexes, and right side are the values.

delay["If Delayed, for how long?"].astype(int)

print(delay["If Delayed, for how long?"].min())
print(delay["If Delayed, for how long?"].max())
print(delay["If Delayed, for how long?"].mean()
  • Please include self contained code that enables us to duplicate the issue. Pictures of parts of your code are useless for reproducing your problem. – timgeb Sep 12 '18 at 17:36
  • This seems like a bug. I am reopening. Can you post the code instead of a picture? – ayhan Sep 12 '18 at 17:36

1 Answers1

5

Probably pandas should refuse to take the mean of a string column. But it doesn't, so what you get is:

In [154]: s = pd.Series([15,18,16,14,20,16,15]).astype(str)

In [155]: s.sum()
Out[155]: '15181614201615'

In [156]: float(s.sum()) / len(s)
Out[156]: 2168802028802.1428

In [157]: s.mean()
Out[157]: 2168802028802.1428

s.min() and s.max() will "work", but it's the lexicographic minimum and maximum, not the numerical, so '111' < '20'.

Make your column numerical, whether int or float, whichever you prefer, and remember that .astype doesn't work in-place, so you'll need

delay["If Delayed, for how long?"] = delay["If Delayed, for how long?"].astype(int)

if you want the column to actually change.

DSM
  • 342,061
  • 65
  • 592
  • 494