0

I am having a little trouble deciding the best way to do this. Currently I have the following:

val df = DataFrame (just take as given)

val intrdd = df.rdd.map(row => processRow(row, df))

def processRow(row: Row, statedf: DataFrame) : Row = {

  val rowvaluesmap = row.getValuesMap[Any](row.schema.fieldNames)
  val imputedvals = rowvaluesmap.transform((key,value) => imputeNulls(key, value, rowvaluesmap, statedf)
  val rowfinal = Row.fromSeq(imputedvals.values.toSeq)

  return rowfinal
}

def imputeNulls(key: String, value: Any, rowvaluesmap: Map[String, _], statedf: DataFrame): Any = {

      val rowvaluesmapnotnull = rowvaluesmap.filter((t) => t._1 == null)
      val rowvaluesmapnull = rowvaluesmap.filter((t) => t._1 != null)


       // keys changed for privacy
      if (value != null) {
        return null
      } else if (value == null && (key == "x" | key == "y" | key == "z")) {
        val imputedval = imputeNullValueString(key, value, rowvaluesmapnotnull, statedf)
        return imputedval
      } else if (value == null && (key == "1" | key == "2" | key == "3")) {
        val imputedval = imputeNullValueInt(key, value, rowvaluesmapnotnull, statedf)
        return imputedval
      } else if (value == null && (key == "a" | key == "b" | key == "c" | key == "d")) {
        val imputedval = imputeNullValueShort(key, value, rowvaluesmapnotnull, statedf)
        return imputedval
      } else if (value == null && (key == "z" | key == "r" | key == "w" | key == "q")) {
        val imputedval = imputeNullValueFloat(key, value, rowvaluesmapnotnull, statedf)
        return imputedval
      } else {
        return null
      }

    }

}

where imputeNullValueX returns a value in the appropriate format.

I am sure there is a better way to do this. Is there an optimal way to do this? I think returning the

  val rowfinal = Row.fromSeq(imputedvals.values.toSeq)

  return rowfinal

is screwing things up.

Thanks.

user48944
  • 311
  • 1
  • 14
  • define "screwing things up"? What's the problem? And can you explain what you're trying to achieve? – Tzach Zohar Jan 22 '18 at 22:59
  • What is even the point of that code? You repeat each condition twice and `if (value == null && (key == "x" | key == "y" | key == "z"))` is equivalent to `if (value == null && (key == "x" | key == "y" | key == "z" | key == "z"))`. – Alper t. Turker Jan 23 '18 at 00:25
  • It would be easier to help you if you actually explained what you're trying to do. It is not clear from just looking at the code. – Roberto Congiu Jan 23 '18 at 00:46
  • Sorry the keys are distinct, I just didn't make them so. I am trying to map a function to each row that then maps a function to each element on the row if the element is null. This will return either a value (that should match the row) or null. Basically I am trying to do a greedy dynamic search. I look at each null value and try to find the column that best predicts that value given the non-null value of the same row. – user48944 Jan 23 '18 at 00:57
  • My guess is the problem is [Why does this Spark code make NullPointerException?](https://stackoverflow.com/q/47111607/8371915) – Alper t. Turker Jan 23 '18 at 01:09

0 Answers0