Remove null rows in pyspark dataframe

Question

When I loaded a fairly large dataset (i.e. Wikipedia's archives) into a spark dataframe, I received the below error:

    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.lang.NullPointerException
    at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
    at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)

What is the best way to remove Null values within a pyspark dataframe?

This is the function to remove empty row from dataframe in pyspark **df = df.dropna(how='all')** — Bala cse, Feb 10 '20 at 09:59

score 4 · Answer 1 · answered Oct 22 '18 at 07:11

4

you can use na.drop() in order to remove all rows including Null values:

df.na.drop()

answered Oct 22 '18 at 07:11

Ali AzG

1,861
2
18
28

Remove null rows in pyspark dataframe

1 Answers1