2

I'm a Python newbie and I've noticed something strange in such a basilar function as print()

Let the code explain. I would like to save in a list all the outliers of an observation. So I've written the following snippet:

#import numpy as np

def compute_outliers(obs):
    outliers=[]

    q1 = np.percentile(obs, 25)
    q3 = np.percentile(obs, 75)
    iqr = q3 - q1
    print('q1: ', q1)
    print('q3: ', q3)
    lower_limit = q1 - 1.5 * iqr
    upper_limit = q3 + 1.5 * iqr

    for i in obs:
        if i < lower_limit or i > upper_limit:
            outliers.append(i)
    return outliers

outliers = compute_outliers(data)

Where data is a general feature (in the sense of "column") of a DataFrame object, from pandas library.

Now, if I tape

for i in outliers:
    print(i)

The outputi is ok:

20.0
0.0
17.6
2.7
18.9
0.0
18.0

While, if I type:

print(outliers)

This is the output:

[20.0, 0.0, 17.600000000000001, 2.7000000000000002, 18.899999999999999, 0.0, 18.0]

You can see the values (the third, the fourth, the fifth) are 'dirty'. I should simply use the first code for printing, but I'm curoius about how all of this works, so I would like to know WHY this happens.

EDIT

I think that to complete the question would be useful to know how to 'fix' this issue, so printing the list of right values. Could you help?

Bernheart
  • 607
  • 1
  • 8
  • 17

2 Answers2

3

This effect is results from a combination of these facts:

Community
  • 1
  • 1
das-g
  • 9,718
  • 4
  • 38
  • 80
1

Yeah, it's a well-known floating point issues and some trickery with repr and str in Python.

If you use Python 2, you can try this:

print(0.1 + 0.2)
# 0.3
print([0.1 + 0.2])
# [0.30000000000000004]

This is because 0.1 + 0.2 is in fact not equal to 0.3 in IEEE 754 floating point numbers. This is due to 0.1 is not 1/10 as the latter cannot be written as finite binary floating point number at all.

When you invoke print on a number, it uses str() for that number. str() is a representation that aims on readability and it can omit some "insignificant" digits to make number more readable.

On the other hand, when you print a list, an algorithm to stringify that list uses repr for every item. repr() aims at exactness and reproducibility, so it provides all digits that are needed to reconstruct the number. It does not mean that it uses all the digits (e.g. repr(0.1) is still "0.1", not "0.1000000000000000055511151" that can be obtained by print("%.25f" % 0.1)), but it can use more digits then str do.

EDIT. If you want more user-friendly output when print a list, you can do it manually with something like:

print(", ".join("{:.2f}".format(x) for x in outliers))

See also this thread for different approaches and this site for more formatting options.

Community
  • 1
  • 1
Ilya V. Schurov
  • 7,687
  • 2
  • 40
  • 78