5

I've noticed that when certain Numpy float64 values are saved as Excel file (via a Pandas DataFrame), they get changed. First I thought this has to do with some imprecision in Excel, but Excel seems to encode floating point numbers as double precision, so I am a bit confused about this observation.

>>> import numpy as np
>>> import pandas as pd

# Create a floating point number that exhibits the problem.
>>> ba = bytearray(['\x53', '\x2a', '\xb0', '\x49', '\xf3', '\x79', '\x90', '\x40'])
>>> ba
bytearray(b'S*\xb0I\xf3y\x90@')
>>> f = np.frombuffer(ba)
>>> f[0]
1054.4875857854684

# Write to dataframe to save as Excel file.
>>> df = pd.DataFrame({'a': f})
>>> df.to_excel('test.xlsx', engine='xlsxwriter')

# Read excel file (when viewing the file in LibreOffice, the 
# value isn't 1054.4875857854684 any more).
>>> df2 = pd.read_excel('test.xlsx')
>>> df2.ix[0,'a']
1054.4875857854699
>>> df2.ix[0,'a'] == f[0]
False

Why is it not possible to read the same float64 back from Excel that was previously written?

I also tried this with Openpyxl (.xlsx format) and Xlwt (.xls format) as engines. While the former produced the same erroneous result as xlsxwriter, Xlwt was actually working as expected and wrote the float according to the exact variable value. Is there perhaps a parameter that I miss for the .xlsx format writer engines?

# this uses the xlwt engine
>>> df.to_excel('test.xls')
>>> df2 = pd.read_excel('test.xls')
>>> df2.ix[0,'a'] == f[0]
True
orange
  • 7,755
  • 14
  • 75
  • 139

2 Answers2

1

I also tried this with Openpyxl (.xlsx format) and Xlwt (.xls format) as engines. While the former produced the same erroneous result as xlsxwriter, Xlwt was actually working as expected and wrote the float according to the exact variable value.

The difference is that .xls is a binary file format and the 64 bit representation of the the IEEE 754 double is written exactly to the file and can be read back to the same 64 bits.

The .xlsx file format however is a collection of text XML files in a zip container. As such doubles are written as a string representation of the double (using a format like '%.16g') and read in by converting that string representation back to a double. That is essentially a lossey process for doubles since there isn't an exact string representation for a vast majority of IEEE 754 numbers.

For example if you take the numpy number in your example and format it with different precisions you will get different representations:

>>> '%.16g' % f[0]
'1054.487585785468'

>>> '%.17g' % f[0]
'1054.4875857854684'

>>> '%.18g' % f[0]
'1054.48758578546835'

You can also demonstrate this yourself by pasting 1054.4875857854684 into a cell in Excel, saving the file and examining the output:

So for a file like this:

enter image description here

You would get something like this:

$ unzip numpy.xlsx -d numpy

$ xmllint --format numpy/xl/worksheets/sheet1.xml | grep 1054
        <v>1054.4875857854599</v>

This is more or less what you are seeing when you read the file back in using Pandas.

jmcnamara
  • 38,196
  • 6
  • 90
  • 108
  • In the implementation I looked at you leave the float formatting part in `XlsxWriter` undefined though. `attr += ' %s="%s"' % (key, value)` uses god knows what kind of formatting precision... – orange Jul 02 '17 at 02:34
  • That code is for the XML attribute(s). The double value is formatted with `%.16g`. From experimentation this matches Excel (although for some numbers it seems to use `%.17g`). I have no way of knowing but Excel probably uses a dedicated function for writing doubles rather than `sprintf()`. It would make sense from a performance point of view at least. Either way, this is at the edge of the precision of a double so some loss is inevitable. If you are interested you should create numbers in Excel and look at the representation in the output XML. – jmcnamara Jul 02 '17 at 07:24
0

After some digging in Pandas and XlsxWriter, I essentially found two conversion steps from numpy.float64 to the .xlsx file:

1) numpy.float64 => float (no loss of fidelity) in pandas/io/excel.py

def _conv_value(val):
    # Convert numpy types to Python types for the Excel writers.
    if com.is_integer(val):
        val = int(val)
    elif com.is_float(val):
        val = float(val)
    elif com.is_bool(val):
        val = bool(val)
    elif isinstance(val, Period):
        val = "%s" % val
    elif com.is_list_like(val):
        val = str(val)

    return val

2) float => string (attr += ' %s="%s"' % (key, value)). this is where precision is changed (in xlswriter/xmlwriter.py)

def _xml_number_element(self, number, attributes=[]):
    # Optimised tag writer for <c> cell number elements in the inner loop.
    attr = ''

    for key, value in attributes:
        value = self._escape_attributes(value)
        attr += ' %s="%s"' % (key, value)

    self.fh.write("""<c%s><v>%.15g</v></c>""" % (attr, number))

So the serialisation (step 2) is where the the precision is changed. I guess, since xls is a binary format, the float would be directly written, without conversion.

orange
  • 7,755
  • 14
  • 75
  • 139