1

I need to write a couple of floats to a text file and store a CRC32 checksum with them. Then when I read the floats back from the text file, I want to recompute the checksum and compare it to the one that was previously computed when saving the file. My problem is that the checksum sometimes fails. This is due to the fact that equal floating point numbers can be represented by different bit patterns. For completeness' sake, I will summarize the code in the next paragraphs.

I have adapted this CRC32 algorithm which I found after reading this question. Here's what it looks like:

uint32_t updC32(uint32_t octet, uint32_t crc) {
    return CRC32Tab[(crc ^ octet) & 0xFF] ^ (crc >> 8);
}

template <typename T>
uint32_t updateCRC32(T s, uint32_t crc) {
    const char* buf = reinterpret_cast<const char*>(&s);
    size_t len = sizeof(T);

    for (; len; --len, ++buf)
        crc = updC32(static_cast<uint32_t>(*buf), crc);
    return crc;
}

CRC32Tab contains exactly the same values as the large array in the file linked above.

This is an abbreviated version of how I write the floats to a file and compute the checksum:

float x, y, z;

// set them to some values

uint32_t crc = 0xFFFFFFFF;
crc = Utility::updateCRC32(x, crc);
crc = Utility::updateCRC32(y, crc);
crc = Utility::updateCRC32(z, crc);
const uint32_t actualCrc = ~crc;

// stream is a FILE pointer, and I don't mind the scientific representation
fprintf(stream, " ( %g %g %g )", x, y, z);
fprintf(stream, " CRC %u\n", actualCrc);

I read the values back from the file as follows. There is actually a lot more involved as the file has a more complex syntax and has to be parsed, but let's assume that getNextFloat() returns the textual representation of each float written before.

float x = std::atof(getNextFloat());
float y = std::atof(getNextFloat());
float z = std::atof(getNextFloat());

uint32_t crc = 0xFFFFFFFF;
crc = Utility::updateCRC32(x, crc);
crc = Utility::updateCRC32(y, crc);
crc = Utility::updateCRC32(z, crc);
const uint32_t actualCrc = ~crc;

const uint32_t fileCrc = // read the CRC from the file
assert(fileCrc == actualCrc); // fails often, but not always

The source of this problem to be that std::atof will return a different bit representation of the float encoded in the string which was read from the file than the bit representation of the float that was used to write that string to the file.

So, my question is: Is there another way to achieve my goal of checksumming floats which are roundtripped through a textual representation other than to checksum the strings themselves?

Thanks for reading!

Community
  • 1
  • 1
Kristian Duske
  • 1,769
  • 9
  • 14
  • Why dont you check your assumption by outputting the bitpattern? – PlasmaHH Mar 15 '13 at 10:29
  • You are right, I should have done that right away. I have now, and the bit patterns don't match. I will remove that part of the question. – Kristian Duske Mar 15 '13 at 10:33
  • IEEE 754 binary floating-point values do not have multiple representations except for zero (+0 and -0) and NaNs. If converting any other floating-point value to a numeral (including “infinity”) and then back to a floating-point value does not produce the same bits as the original value, then the conversions were performed inaccurately. – Eric Postpischil Mar 15 '13 at 14:08

5 Answers5

1

If the text file doesn't have to be human-readable, use hexadecimal float literals instead, they are exact so you won't have this problem of differences between textual and in-memory values.

Community
  • 1
  • 1
unwind
  • 391,730
  • 64
  • 469
  • 606
  • Thanks for your suggestion, but it has to be both human-readable and readably by other programs. – Kristian Duske Mar 15 '13 at 10:42
  • @KristianDuske: Maybe you just need to print with more digits, use `std::numeric_limits::max_digits10` digits to get everything needed. I think standard printf is only 6 or so. – PlasmaHH Mar 15 '13 at 10:45
  • @PlasmaHH: you might need to add 1 (maybe even 2) to that digits10 number. – sellibitze Mar 15 '13 at 10:47
  • Those digits would just be 0 (in case of non-scientific representation), wouldn't they? This is not a rounding issue I think. – Kristian Duske Mar 15 '13 at 10:51
  • @MSalters: Are you saying that it's impossible to create the text "-0" for -0 and "0" for "+0"? The standard may not guarantee that their float-to-text conversion includes the sign even for zero. But it should not bee too hard to attach a minus sign in front of the text. – sellibitze Mar 15 '13 at 10:51
  • @sellibitze: You'd also have to write your own `atof`, as that too is not guaranteed to parse `"-0"` as `-0.0` – MSalters Mar 15 '13 at 10:53
  • @sellibitze: No need to add 1 to that number, it is specifically designed to be the correct number for exactly that purpose. – PlasmaHH Mar 15 '13 at 10:56
  • @KristianDuske: What makes you think max_digits10 will be 0? It is 9,17,21 on my system for the common floating types, and printf normally only outputs 6 per default. Just give it a try. – PlasmaHH Mar 15 '13 at 10:59
  • @PlasmaHH: My implementation yields 15 (GCC) for double, but you actually need 17 decimal digits to make it lossless in every case. So either, the implementation is non-conforming or you are wrong. ;) – sellibitze Mar 15 '13 at 10:59
  • @sellibitze: My implementation yields 17 for double. What implementation are you using, and are you sure that you don't use `digits10` instead of `max_digits10`? Don't trust me, read 18.3.2.4-13: "Number of base 10 digits required to ensure that values which differ are always differentiated." – PlasmaHH Mar 15 '13 at 11:02
  • @PlasmaHH: G++ 4.6.1, digits10: 15, max_digits10: 17. Sorry, I did not notice that you wrote "max_digits10". I only knew about "digits10". Seems like that's another C++11 thing I did not know about before. Thanks for pointing it out! – sellibitze Mar 15 '13 at 11:05
1

The source of the issue is apparent from your comment:

If I'm not completely mistaken, there is no rounding happening here. The %g specifier chooses the shortest string representation that exactly represents the number.

This is incorrect. If no precision is specified, it defaults to 6, and rounding will definitely occur for most floating-point inputs.

If you need a human-readable round-trippable format, %a is by far the best-choice. Failing that, you will need to specify a precision of at least 9 (assuming that float on your system is IEEE-754 single precision).

You may still be tripped up by NaN encodings, since the standard does not specify how or if they must be printed.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Thank you. I realize that my code has several issues: First, it doesn't write the exact numbers to the file, because they get rounded by %g. Second, checksumming the bit representation of the floats will yield non-matching checksums for such numbers that can be represented by several bit patterns. I suppose my best bet is to checksum the string representations instead. – Kristian Duske Mar 15 '13 at 11:11
  • One more question: I cannot use %a because other programs that read this file cannot deal with it. Is %f a good option? – Kristian Duske Mar 15 '13 at 11:13
  • Disregard that last question. I need to specify the precision manually. – Kristian Duske Mar 15 '13 at 11:15
0

If your standard library's float-to-text and text-to-float conversions do proper rounding, you just need enough sigificant digits for the float->text->float roundtrip to be lossless unless you also have Infs and NaNs, still it should be "value-preserving", not necessarily bitpattern preserving since there are multiple representations for infinity or NaN, I think. For an IEEE-754 64 bit double 17 significant digits is just enough to make the roundtrip lossless with respect to the actual value.

sellibitze
  • 27,611
  • 3
  • 75
  • 95
  • If I'm not completely mistaken, there is no rounding happening here. The %g specifier chooses the shortest string representation that exactly represents the number. I think the source of the problem is that the bit pattern of the number written to the file is different from the bit pattern of the number after it was parsed from the file. – Kristian Duske Mar 15 '13 at 10:49
  • @KristianDuske: What do you mean by "exactly represents the number"? If you want to *exactly* express the value of a float in decimal you need *much* more digits that the number of digits you get. However, it may be accurate enough. But it seems it's not even that. Try to increase the number of significant digits so that the mapping from float to text will be injective. – sellibitze Mar 15 '13 at 10:57
  • @KristianDuske: No, that is not how %g works. It prints 6 significant digits per default. – PlasmaHH Mar 15 '13 at 11:03
  • @KristianDuske: How is that enough?! – sellibitze Mar 15 '13 at 11:06
  • I didn't know that %g resulted in rounding. I thought that it would be exact. I will specify the precision when writing the floats manually. – Kristian Duske Mar 15 '13 at 11:16
  • @KristianDuske: There is almost always rounding going on simply because the number systems are different. For example, try expressing 0,4 in binary. Try expressing 1+1/512 *exactly* in decimal... – sellibitze Mar 15 '13 at 11:36
0

Your CRC algorithm is flawed for any type which has multiple binary representations for a single value. IEEE 754 has two representations for 0.0, to wit +0.0 and -0.0. Other, non-finite values such as NaN are potentially troublesome too.

MSalters
  • 173,980
  • 10
  • 155
  • 350
0

Would it be acceptable to canonicalize your numbers before you update the CRC? So while saving, you would get a temporary string version of your number (with sprintf or whatever matches your serialization's format), then convert this string back to a numeric value, and then use this result to update the CRC. This way, you know that the CRC will match the deserialized value.

Christopher Oicles
  • 3,017
  • 16
  • 11
  • Thanks for your answer. I don't think this is necessary if I specify the precision manually (I need the strings to be exact, but didn't know that %g performed rounding). I think I will be better off by running the CRC32 on the string representations instead. – Kristian Duske Mar 15 '13 at 11:18