0

When I loaded a fairly large dataset (i.e. Wikipedia's archives) into a spark dataframe, I received the below error:

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.NullPointerException
    at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
    at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)

What is the best way to remove Null values within a pyspark dataframe?

Johan Sulaiman
  • 301
  • 1
  • 3
  • 7

1 Answers1

4

you can use na.drop() in order to remove all rows including Null values:

df.na.drop()
Ali AzG
  • 1,861
  • 2
  • 18
  • 28