0

I want to change the nullable property of a particular column in a Spark Dataframe.

If I print schema of the dataframe currently it looks like below.

col1: string (nullable = false)
col2: string (nullable = true)
col3: string (nullable = false)
col4: float (nullable = true)

I just want col3 nullable property to be updated.

col1: string (nullable = false)
col2: string (nullable = true)
col3: string (nullable = true)
col4: float (nullable = true)

I checked online here are some links, but seems like they are doing it for all the columns but not to a specific column, see Change nullable property of column in spark dataframe. Can any one please help me in this regard?

Shaido
  • 27,497
  • 23
  • 70
  • 73
Ramesh
  • 1,563
  • 9
  • 25
  • 39
  • I am performing some action on one column by using .withColumn. After that i performed a printSchema then in the schema i observed that nullable is getting updated as nullable = true by default. How can i control that nullable value(true or false). – Ramesh Sep 27 '16 at 22:10
  • In general value of `nullable` attribute is defined by the semantics of the input (primitives / boxed primitives / Option) or function you apply. So _some action_ is important here. – zero323 Sep 28 '16 at 06:42

1 Answers1

3

There is no "clear" way to do this. You can use trick like here

Relevant code from that answer:

def setNullableStateOfColumn( df: DataFrame, cn: String, nullable: Boolean) : DataFrame = {

  // get schema
  val schema = df.schema
  // modify [[StructField] with name `cn`
  val newSchema = StructType(schema.map {
    case StructField( c, t, _, m) if c.equals(cn) => StructField( c, t, nullable = nullable, m)
    case y: StructField => y
  })
  // apply new schema
  df.sqlContext.createDataFrame( df.rdd, newSchema )
}

It would copy DataFrame and copy schema, but with specyfying nullable programatically

Version for many columns:

def setNullableStateOfColumn(df: DataFrame, nullValues: Map[String, Boolean]) : DataFrame = {

  // get schema
  val schema = df.schema
  // modify [[StructField]s with name `cn`
  val newSchema = StructType(schema.map {
    case StructField( c, t, _, m) if nullValues.contains(c) => StructField( c, t, nullable = nullValues.get(c), m)
    case y: StructField => y
  })
  // apply new schema
  df.sqlContext.createDataFrame( df.rdd, newSchema )
}

Usage: setNullableStateOfColumn(df1, Map ("col1" -> true, "col2" -> true, "col7" -> false));

Community
  • 1
  • 1
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • I think this way I cannot change the nullable value of only one particular column from the dataframe. – Ramesh Sep 28 '16 at 02:32
  • Technically, you're not changing, but creating new dataset with new schema. Look at line with `case StructField` - there is `if` column name is equal to column name from argument, then set nullability as given from parameter. If not, column data will be copied – T. Gawęda Sep 28 '16 at 07:35
  • So if I have 4 columns in a dataframe and i had to modify nullable value for only one column. val df2 = df1.withColumn("col1", udfname(df1, "col1", true)) and then i need to call that method "setNullableStateOfColumn" through udf function which i created ? – Ramesh Sep 28 '16 at 13:26
  • udf will be something like val udfname = udf(setNullableStateOfColumn _) please let me know. Thanks – Ramesh Sep 28 '16 at 13:27
  • No, just invoke `setNullableStateOfColumn (df1, "col1", false)`, not as UDF but normal function – T. Gawęda Sep 28 '16 at 13:28
  • @Ramesh it worked? Have you got any additional questions? – T. Gawęda Sep 29 '16 at 13:59
  • Hi Gaweda, Sorry i had to be away for some time. I am trying to solve this now. – Ramesh Sep 29 '16 at 21:40
  • @Gaweda, Its working for single column. How can i modify the setNullableStateOfColumn function to work on more than one column's nullable property? – Ramesh Sep 30 '16 at 03:39
  • I know that my question was actually for one column, but i also would like to know for more than one column. Can you please reply on the same. – Ramesh Sep 30 '16 at 03:42
  • @ Gaweda, How to invoke this new function which had Map datatype as a argument. I mean if I create a Map with name "colNullableValues". val colNullableValues = Map("Col1" -> "false, "Col2" -> "true", "col3" -> "true") setNullableStateOfColumn(df1, colNullableValues) – Ramesh Sep 30 '16 at 19:28
  • It is map string-> boolean, not string->string – T. Gawęda Sep 30 '16 at 19:43
  • val colNullableValues = Map("Col1" -> false, "Col2" -> true, "col3" -> true) – Ramesh Sep 30 '16 at 19:55
  • And now `setNullableStateOfColumn(df1, colNullableValues)`, just like it was in answer – T. Gawęda Sep 30 '16 at 20:05
  • @Gaweda, The code above which we were discussing is for direct columns. But if we have a array field in dataframe and which has 3 columns in it whose schema is like below. |-- arraycol: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- col1: string (nullable = true) | | |-- col2: string (nullable = false) | | |-- col3: string (nullable = true) – Ramesh Oct 02 '16 at 15:43
  • ` val colNullableValues = Map("Col1" -> false, "Col2" -> true, "col3" -> true) def setNullableStateOfmultiColumn(df: DataFrame, nullValues: Map[String, Boolean]) : DataFrame = { // get schema val schema = df.schema // modify [[StructField]s with name `cn` val newSchema = StructType(schema.map { case StructField( c, t, _, m) if colNullableValues.contains(c) => StructField( c, t, nullable = nullValues.get(c), m) case y: StructField => y }) // apply new schema df.sqlContext.createDataFrame( df.rdd, newSchema ) }` – Ramesh Oct 03 '16 at 13:28
  • @Gaweda I am not able use your code for multi column. I used the above code for defining it. I am getting an issue saying cannot resolve StructField with such signature and also nullValues.get(c) its saying type mismatch, it is expecting Boolean but found Option[Boolean]. Can you please help to resolve. – Ramesh Oct 03 '16 at 13:31