0

Using Apache Spark 2.0.2 I have a table stored as parquet which contains about 23 millions rows and about 300 columns. I have a column called total_price stored as double, if I execute:

select sum(total_price) from my_table;

+-----------------+
|   total_price   |
+-----------------+
|   9.3923769592E8|
+-----------------+

So this number 9.3923769592E8 is wrong.

but if I execute:

select year, sum(total_price) from my_table;

+-------+------------------------+
|   year|             total_price|
+-------+------------------------+
|   2017|  44510982.10004025     |
|   2016| 293320440.63992333     |
|   2015| 311512575.890131       |
|   2014| 289885757.2799143      |
|   2013|      5192.319          |
|   2012|      2747.7000000000007|
+-------+------------------------+

My assumption is that on the first query the double data type has an overflow or something like it.

  • Why I'm getting the result with so many decimals after the dot if they are stored as #.##?

  • How can I fix the error of the first query?

1 Answers1

0

The value you get looks just fine - 9.3923769592E8 is roughly ~939,237,695 and more or less what you expect based on the numbers aggregated by year.

Regarding the values you get you have to remember that only some numbers are representable using floating point arithmetics and commonly used types, like Scala Double or Float, are not suitable for use cases where exact values are necessary (accounting for example). For application like this you should use DecimalType.

I would also recommend reading What Every Computer Scientist Should Know About Floating-Point Arithmetic and Is floating point math broken?

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935