1

Passing some time away. Non-pandas scenario here, and in pyspark I can generate the column value being a value concatenated with relevant column name, e.g. a solution I provided: Appending column name to column value using Spark.

Then, the following:

import org.apache.spark.sql.functions._
import spark.implicits._

val df = sc.parallelize(Seq(
    ("r1", 0.0, 0.0, 0.0, 0.0),
    ("r2", 6.4, 4.9, 6.3, 7.1),
    ("r3", 4.2, 0.0, 7.2, 8.4),
    ("r4", 1.0, 2.0, 0.0, 0.0)
)).toDF("ID", "aa1a", "bb3", "ccc4", "d1ddd")

val count_zero = df.columns.tail.map(x => when(col(x) === 0.0, 1).otherwise(0)).reduce(_+_)

df.withColumn("zero_count", count_zero).show(false)

So, what if, for arguments sake (only),

  • I wanted to also check that the actual column name contained a '1' somewhere in its name, as an extra condition in order to add the 1.

  • And I wanted this in the val_count_zero within the when?

I am not interested in generating column lists, sequences to process.

As I stated it is for arguments sake. I cannot find the approach here to get column name check in Scala within a when for a dataframe.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

1

If I understand your requirement correctly, you could Column-ize the column names and include the additional condition using method contains:

val count_zero = df.columns.tail.map(x =>
    when(lit(x).contains("1") && col(x) === 0.0, 1).otherwise(0)
  ).
  reduce(_ + _)
thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Leo C
  • 22,006
  • 3
  • 26
  • 39