1

I have a pyspark data frame that has 7 columns, I have to add a new column named "sum" and calculate a number of columns that have data (Not null) in the sum column.Example a data frame in which yellow highlighted part is required answer

1 Answers1

1

This sum can be calculated like this:

df = spark.createDataFrame([
    (1, "a", "xxx", None, "abc", "xyz","fgh"), 
    (2, "b", None, 3, "abc", "xyz","fgh"),
    (3, "c", "a23", None, None, "xyz","fgh")
], ("ID","flag", "col1", "col2", "col3", "col4", "col5"))

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

df2 = df.withColumn("sum",sum([(~F.isnull(df[col])).cast(IntegerType()) for col in df.columns]))
df2.show()
+---+----+----+----+----+----+----+---+
| ID|flag|col1|col2|col3|col4|col5|sum|
+---+----+----+----+----+----+----+---+
|  1|   a| xxx|null| abc| xyz| fgh|  6|
|  2|   b|null|   3| abc| xyz| fgh|  6|
|  3|   c| a23|null|null| xyz| fgh|  5|
+---+----+----+----+----+----+----+---+

Hope this helps!

michalrudko
  • 1,432
  • 2
  • 16
  • 30
  • Thank for, It's giving me error "Column is not iterable". –  Jan 19 '20 at 15:05
  • actually, in my case it's... `F.sum(...)` that gives me such error, so I'd say please check your imports and this should work – michalrudko Jan 19 '20 at 16:09
  • 1
    I have imported like this from pyspark.sql import functions as F from pyspark.sql.types import IntegerType still, it is not working. –  Jan 19 '20 at 17:10
  • Again, this must have something to do with the imports - maybe you have imported something above which overrides some functions? Please remove all the preceding imports and try again. Another explanation to this issue you may find here: https://stackoverflow.com/a/53868119/4113409 . There is nothing else I could do here... – michalrudko Jan 19 '20 at 17:51
  • you're aggregating the columns in a row, so make sure you're using the Python's sum function, not the PySpark one (from sql.functions) "You can delete the reference of the pyspark function with `del sum`." – michalrudko Jan 19 '20 at 17:54
  • 2
    Thank you very much, After deleting sum (del sum) it worked. –  Jan 19 '20 at 18:51
  • Glad to hear that! Please just kindly mark this as a correct answer if you're happy with the result. Thanks! – michalrudko Jan 19 '20 at 19:41
  • I already did it but it is showing me "Votes cast by those with less than 15 reputations are recorded but do not change the publicly displayed post score. –  Jan 20 '20 at 05:10
  • ah, ok - thanks then :), I guess that this will be visible as soon you get 15 reputation, good luck! – michalrudko Jan 20 '20 at 11:20