6

What would be a fast a safe way to convert a String to a numeric type, while providing a default value when the conversion fails ?

I tried using the usually recommended way, i.e. using Exceptions:

implicit class StringConversion(val s: String) {

  private def toTypeOrElse[T](convert: String=>T, defaultVal: T) = try {
    convert(s)
  } catch {
    case _: NumberFormatException => defaultVal
  }

  def toShortOrElse(defaultVal: Short = 0) = toTypeOrElse[Short](_.toShort, defaultVal)
  def toByteOrElse(defaultVal: Byte = 0) = toTypeOrElse[Byte](_.toByte, defaultVal)
  def toIntOrElse(defaultVal: Int = 0) = toTypeOrElse[Int](_.toInt, defaultVal)
  def toDoubleOrElse(defaultVal: Double = 0D) = toTypeOrElse[Double](_.toDouble, defaultVal)
  def toLongOrElse(defaultVal: Long = 0L) = toTypeOrElse[Long](_.toLong, defaultVal)
  def toFloatOrElse(defaultVal: Float = 0F) = toTypeOrElse[Float](_.toFloat, defaultVal)
}

Using this utility class, I can now easily convert any String to a given numeric type, and provide a default value in case the String is not representing correctly the numeric type:

scala> "123".toIntOrElse()
res1: Int = 123
scala> "abc".toIntOrElse(-1)
res2: Int = -1
scala> "abc".toIntOrElse()
res3: Int = 0
scala> "3.14159".toDoubleOrElse()
res4: Double = 3.14159
...

While it works beautifully, this approach does not seem to scale well, probably because of the Exceptions mechanism:

scala> for (i<-1 to 10000000) "1234".toIntOrElse()

takes roughly 1 second to execute whereas

scala> for (i<-1 to 10000000) "abcd".toIntOrElse()

takes roughly 1 minute!

I guess another approach would be to avoid relying on exceptions being triggered by the toInt, toDouble, ... methods.

Could this be achieved by checking if a String "is of the given type" ? One could of course iterate through the String characters and check that they are digits (see e.g. this example), but then what about the other numeric formats (double, float, hex, octal, ...) ?

Community
  • 1
  • 1
borck
  • 928
  • 10
  • 19
  • 1
    Regex is probably your best way to go here if you completely want to avoid the overhead of the try/catch semantic. You just need to come up with regexes for each of the possible numeric types you want to be able to convert from. But honestly, that is probably a premature optimization. How fast does this code need to be? How often is it hit? How often will it get invalid numbers this hitting the catch block? These are questions you need to ask yourself before optimizing as the code gets a bit more complex. – cmbaxter May 08 '14 at 12:39
  • @cmbaxter I agree with you but I'm using this in a Big Data context, where I parse huge CSV files (Billions of rows), so it matters. – borck May 08 '14 at 14:19
  • Fair enough. Then I would go with Regex to vet the string first. Will be much faster. – cmbaxter May 08 '14 at 14:22
  • 1
    @pbr consider updated answer with enriched characters that may belong to a numeric value, yet to avoid performance penalty, no specialised parsing done. This may prove helpful for filtering out most non numeric values. – elm May 08 '14 at 17:14
  • @pbr consider also http://stackoverflow.com/a/16699049/3189923 (and Apache Commons Lang). – elm May 08 '14 at 20:53

1 Answers1

1

As a first approach, filter out those input strings that do not contain any digit

private def toTypeOrElse[T](convert: String=>T, defaultVal: T) = try {
  if (s.contains("[0-9]")) convert(s) {
    else defaultVal
  } catch {
    case _: NumberFormatException => defaultVal
  }
}

Update

Enriched set of characters that may occur in a numeric value, yet no order of occurrence or limits in repetition considered,

private def toTypeOrElse[T](convert: String=>T, defaultVal: T) = try {
    if (s matches "[\\+\\-0-9.e]+") convert(s)
    else defaultVal
  } catch {
    case _: NumberFormatException => defaultVal
  }
}
Pratik Khadloya
  • 12,509
  • 11
  • 81
  • 106
elm
  • 20,117
  • 14
  • 67
  • 113