Unexplainable behavior with PySpark and dropping nulls

Question

I have a Spark DataFrame in PySpark that I am trying to remove the nulls from.

Before, when cleaning things up during parsing, I ran a convert_to_null method on the title column that basically checks whether a string literally says "None" and if so, converts that to an actual None. That way, Spark converts it to an internal null type.

Now, I'm trying to drop the rows with that null type in the title column. Here is everything I've tried to remove my nulls:

new_df = df.na.drop('title')

new_df = df[F.col('title').isNotNull()]

new_df = df[~F.col('title').isNull()]

But I always get this error at the new_df.show() call a few lines after:

Py4JJavaError: An error occurred while calling o2022.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 87.0 failed 1 times, most recent failure: Lost task 1.0 in stage 87.0 (TID 314, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main process() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 324, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in dump_stream for obj in iterator: File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 313, in _batched for item in iterator: File "<string>", line 1, in <lambda> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 75, in <lambda> return lambda *a: f(*a) File "/usr/local/spark/python/pyspark/util.py", line 55, in wrapper return f(*args, **kwargs) File "<ipython-input-16-48bc3ec1b5d9>", line 5, in replace_none_with_null TypeError: 'in <string>' requires string as left operand, not NoneType

I think I'm going crazy. I have no clue how to fix things. Any help is appreciated. Thanks!

Your code fail in Python code (like. `RDD` or `udf`), not here. Please post a [mcve] ([How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/6910411)) — zero323, Oct 07 '18 at 10:12

Unexplainable behavior with PySpark and dropping nulls

0 Answers0