14

The following code is part of a bigger project. In my project I have to read a large text file, with probably many million lines, with each line having a pair of decimals separated by space.

An example is the following:

-0.200000 -1.000000
-0.469967 0.249733
-0.475169 -0.314739
-0.086706 -0.901599

Until now I used a custom made parser, created by me, which worked fine but it was not the fastest one. Searching online I found numpy's loadtxt and pandas read_csv. The first one worked great but it's speed was even worse than mine. The second one was pretty fast but I was getting errors later in my project (I solve some PDEs with finite element method and while reading the coordinates with either my parser or loadtxt I get the correct result, when I use read_csv the matrix A of the system Ax=b becomes singular).

So I created this test code to see what's going on:

import numpy as np
import pandas as pd

points_file = './points.txt'

points1 = pd.read_csv(points_file, header=None, sep='\s+', dtype=np.float64).values
points2 = np.loadtxt(points_file, dtype=np.float64)

if (np.array_equal(points1, points2)):
    print ('Equal')
else:
    print ('Not Equal')

for i in range(len(points1)):
    print (points1[i] == points2[i])

Surprisingly the output was:

Not Equal
[ True  True]
[ True False]
[False  True]
[False False]

Already quite confused, I continued searching and I found this function from user "Dan Lecocq" to get the binary representation of the numbers.

So for the 2nd number in the 2nd line (0.249733) the binary representation from read_csv and loadtxt was respectively:

0011111111001111111101110100000000111101110111011011000100100000
0011111111001111111101110100000000111101110111011011000100100001

and the decimal values:

2.49732999999999982776444085175E-1
2.49733000000000010532019700804E-1

Why is this happening? I mean, I read the same string from a text file and I save it in memory as the same data type. I would also love to understand why this small difference affects so much my solution but that involves showing you around 1000 lines of my messy code. I first need to create more test codes to find exactly where is the problem.

Software versions:

Ubuntu 16.04 64bit
Python: 2.7.12
Numpy: 1.11.0
Pandas: 0.18.0
Community
  • 1
  • 1
Tom_K
  • 141
  • 1
  • 4
  • 5
    Pandas has its own decimal-float parsing functions for the purpose of speed. They sometimes do not give the most accurate floating point representations of the decimal inputs. – Robert Kern Jul 18 '16 at 21:18
  • 9
    We are always telling new programmers - don't worry about those extra digits off at the end. Floating point representation of `0.249733` is inherently imprecise. The difference between those 2 numbers is `2**-55`. `np.allclose` returns `True`. – hpaulj Jul 18 '16 at 21:57
  • 3
    Seems like a fair question from someone who wants to understand: "Why is this happening?" – adr Aug 01 '21 at 17:17
  • 1
    It's worth noting that this no longer happens in Python 3 and current versions of numpy and pandas – Angus L'Herrou Feb 10 '22 at 23:04

1 Answers1

1

I would ask myself the following question: How much precision is needed in my project?

I would suggest to use pandas or numpy round(), if you can afford to lose some digits.

Keep in mind float processing is always finicky. Useful resources might be: correcting for floating point arithmetic 'errors' when rounding in pandas

, or if you know nothing about float representation: https://docs.python.org/3/tutorial/floatingpoint.html

feedMe
  • 3,431
  • 2
  • 36
  • 61
frex
  • 41
  • 5