The following code is part of a bigger project. In my project I have to read a large text file, with probably many million lines, with each line having a pair of decimals separated by space.
An example is the following:
-0.200000 -1.000000
-0.469967 0.249733
-0.475169 -0.314739
-0.086706 -0.901599
Until now I used a custom made parser, created by me, which worked fine but it was not the fastest one. Searching online I found numpy's loadtxt and pandas read_csv. The first one worked great but it's speed was even worse than mine. The second one was pretty fast but I was getting errors later in my project (I solve some PDEs with finite element method and while reading the coordinates with either my parser or loadtxt I get the correct result, when I use read_csv the matrix A of the system Ax=b becomes singular).
So I created this test code to see what's going on:
import numpy as np
import pandas as pd
points_file = './points.txt'
points1 = pd.read_csv(points_file, header=None, sep='\s+', dtype=np.float64).values
points2 = np.loadtxt(points_file, dtype=np.float64)
if (np.array_equal(points1, points2)):
print ('Equal')
else:
print ('Not Equal')
for i in range(len(points1)):
print (points1[i] == points2[i])
Surprisingly the output was:
Not Equal
[ True True]
[ True False]
[False True]
[False False]
Already quite confused, I continued searching and I found this function from user "Dan Lecocq" to get the binary representation of the numbers.
So for the 2nd number in the 2nd line (0.249733) the binary representation from read_csv and loadtxt was respectively:
0011111111001111111101110100000000111101110111011011000100100000
0011111111001111111101110100000000111101110111011011000100100001
and the decimal values:
2.49732999999999982776444085175E-1
2.49733000000000010532019700804E-1
Why is this happening? I mean, I read the same string from a text file and I save it in memory as the same data type. I would also love to understand why this small difference affects so much my solution but that involves showing you around 1000 lines of my messy code. I first need to create more test codes to find exactly where is the problem.
Software versions:
Ubuntu 16.04 64bit
Python: 2.7.12
Numpy: 1.11.0
Pandas: 0.18.0