0

This seems to be default behaviour in Spark.

In db the value is a decimal (18,8) for example:

0.00000000

When Spark reads any decimal value that is zero, and has a scale of more than 6 (eg. 0E-06), then it automatically transforms the value to scientific notation.

In this case the value is auto converted to 0E-08 within the Dataframe after reading the value 0.00000000

.

I want to write my dataframe to CSV, BUT.. when writing, Spark writes the 0E-08 value to CSV, not the decimal 0.00000000

Is there a way to write the explicit decimal value to CSV, without scientific notation?

Notes:

  • The app is generic and takes any table as input, and simply writes this table to a CSV file.
  • Therefore the app does not know the schema of the data, nor which are decimal values etc
  • Each possible decimal field, may have a different precision and scale, so I cannot hardcode these.
  • Using Spark 2.4.8
bp2010
  • 2,342
  • 17
  • 34
  • [https://stackoverflow.com/questions/55914742/convert-scientific-notation-in-string-format-to-numeric-in-spark-dataframe](https://stackoverflow.com/questions/55914742/convert-scientific-notation-in-string-format-to-numeric-in-spark-dataframe) – Dicer Jun 01 '23 at 02:33
  • @Dicer thanks but this does not answer my question. 1) the field is already Decimal type so does not need to be cast as a Decimal 2) this happens ONLY with zero fields with E-06 or greater.. – bp2010 Jun 01 '23 at 03:38
  • You have a decimal data `0.00000000`. But after you spark read the dat, it becomes `0E-8`. Check this one https://sqlandhadoop.com/pyspark-handle-scientific-number/ – Dicer Jun 01 '23 at 06:55
  • Thanks that works to format, but from input data the each decimal can have a varying precision/scale, so I need to retrieve the precision/scale from the input data to use generically for all types of decimal input – bp2010 Jun 01 '23 at 15:40

2 Answers2

1
// Any decimal zero value with scale of 7 or greater (ie 0.0000000 or higher scale)
// will get auto converted into scientific notation: 0E-10 or 0E-08 etc
// below function ensures that decimal value is kept with original precision and scale

def ensureDecimalsDoNotConvertToSciNotation(df: DataFrame) = {
  df.select(df.columns.map { column =>
    df.schema(column).dataType.typeName.split("\\(")(0) match {
      case "decimal" => {
        val scale = df.schema(column).dataType.asInstanceOf[DecimalType].scale
        if (scale > 6) { // Spark only auto formats 0E-06 and greater
          format_number(col(column), scale).alias(column)
        } else {
          col(column)
        }
      }
      case _ => col(column)
    }

  }: _*)
}
bp2010
  • 2,342
  • 17
  • 34
-1

Please try the following code. I am not sure whether this is what you want.

# Create a new column named 's' and convert 0E-08 data to non-scientific notation data
df.withColumn("s", 'value.cast("Decimal(0,8)"))
# Save the dataframe as csv and saved to <Your Output> file location 
df.write.format('csv).save("<Your Output>")

This is the docuementation of Decimal function: https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.types.DecimalType.html

Dicer
  • 63
  • 1
  • 9
  • as mentioned in my above comment, the field is already a decimal type, it is a decimal type in the db, and after spark read it it is still a decimal type, with the value of 0E-08 – bp2010 Jun 01 '23 at 04:34