1

While trying to create a tuple column consisting of latitude and longitude coordinates from two seperate columns I stumpled upon zip as a pretty fast alternative to itertuples, list comprehensions, etc. It needs to be fast because I am dealing with roughly 4M rows and I don't want to waste my time on attribute creation.

The good thing is, my question perfectly asks itself by looking at the output of this Code: What is happening and how can this be prevented? I am absolutely positive that e.g. 52.353500 is as precise as it gets and the Dataframe is not just cutting it of for view - because this already equals a (very rough) positional precision of 10 centimeters.

print(df['lat'].head())
print(df['long'].head())
list(zip(df['lat'].head(), df['long'].head()))

Output:

14    52.353500
37    52.355511
42    52.354019
44    52.373829
83    52.354599
Name: lat, dtype: float32

14    5.00611
37    4.90732
42    4.92045
44    4.84816
83    4.89405
Name: long, dtype: float32

[(52.35350036621094, 5.006110191345215),
 (52.35551071166992, 4.907320022583008),
 (52.35401916503906, 4.920450210571289),
 (52.37382888793945, 4.8481597900390625),
 (52.35459899902344, 4.894050121307373)]

As requested: The Dataframe was loaded using read_csv with dtype float32 for both columns.

Solution: It was a mixture of me not knowing the limitations of Series representation of floats, not using float_precision when reading the data in and using float32 in combination with float_precision. Kids, use float dtype and let Pandas decide (to use float64).

dasjanik
  • 326
  • 3
  • 12
  • You should show us how you loaded your DataFrame, so we can replicate it and help you make sure that there wasn't extra precision. – Kyle Jun 04 '19 at 13:45

1 Answers1

2

This is perfectly well defined behaviour, pandas is truncating the trailing digits based on the preset precision:

import math  

math.pi  
# 3.141592653589793

pi has 15 digits of precision here. However, in a Series, it does not show as being so:

pd.Series([math.pi])                                                                                                   

0    3.141593
dtype: float64

pd.Series([math.pi]) .tolist()                                                                                         
# [3.141592653589793]

This is because,

pd.get_option('precision')                                                                                             
# 6

Read more about Options and Settings and how you can change them.

If you want to actually round your floats to a certain precision, use round:

pd.Series([math.pi]).round(decimals=6).tolist()                                                                        
# [3.141593]
cs95
  • 379,657
  • 97
  • 704
  • 746
  • oh wow, didn't know about the Series part. I guess `the Dataframe is not just cutting it of for view` is wrong then. However, I would assume that the following decimal digits were just zeros and an expanded view would show exactly this? Because, as I said, I am sure about the input only having six digit precision. – dasjanik Jun 04 '19 at 13:51
  • @dasjanik I can't comment further without any more context, but like I said you can always round up to the desired precision. – cs95 Jun 04 '19 at 13:52
  • 1
    looking at the Dataframe with `.tolist()` reveals that, indeed, the "faulty" floats were hiding there all the time - https://stackoverflow.com/a/47368368 was the answer I didn't know I needed. Even though neither `high` nor `round_trip` precision eliminates every error this brought me to either using decimals or really rounding. Thanks for pointing me at the precision limitation/option with Series. – dasjanik Jun 04 '19 at 14:34