0

I have a dataframe in spark which contains a column of

df.select("y_wgs84").show
+----------------+
|         y_wgs84|
+----------------+
|47,9882373902965|
|47,9848921211406|
|47,9781530280939|
|47,9731284286555|
|47,9889813907224|
|47,9881440349524|
|47,9744969812356|
|47,9779388492231|
|48,0107946653620|
|48,0161245749621|
|48,0176065577678|
|48,0029496680229|
|48,0061848607139|
|47,9947482295108|
|48,0055828684523|
|48,0148743653486|
|48,0163361315735|
|48,0071490870937|
|48,0178054077099|
|47,8670099558802|
+----------------+

As these were read by spark.read.csv() its schema is of type String. Now I want to convert it to a double as follows:

val format = NumberFormat.getInstance(Locale.GERMANY)
def toDouble: UserDefinedFunction = udf[Double, String](format.parse(_).doubleValue)
df2.withColumn("y_wgs84", toDouble('y_wgs84)).collect

but it fails with java.lang.NumberFormatException: For input string: ".E0" Strangely though, when grepping the file, there is no single record containing an E.

Additionally, df.select("y_wgs84").as[String].collect.map(format.parse(_).doubleValue) this will work just fine. What is wrong here when calling the function as an UDF in spark?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
  • Possible duplicate of [What is a NumberFormatException and how can I fix it?](http://stackoverflow.com/questions/39849984/what-is-a-numberformatexception-and-how-can-i-fix-it) – xenteros Apr 11 '17 at 12:14

2 Answers2

1

Actually, thread safety is the problem. So changing the parsing function to

def toDouble: UserDefinedFunction = udf[Double, String](_.replace(',', '.').toDouble)

works just fine.

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
0

Character 'E' is for Exponential and Scientific Notation; you wont be able to find it using grep. e.g. 10 million is represented as 1.0E7 A quick google search suggests it could be a java bug of some sort https://community.oracle.com/thread/2349624?db=5 Could you try on a different environment

I hope it's not MS Excel magic. Once you open a file in Excel, it tries to be helpful by converting your numbers to exponential notation

sparker
  • 1,245
  • 11
  • 17
  • So far several environments showed the same problem. I was thinking about tread safety of this operation, but ain't sure if this could be the source of a problem. I don't think that scientifically notation numbers are in geo coordinates and besides it strangely works fine when collected locally. – Georg Heiler Apr 10 '17 at 18:17