How to clean rdd or DataFrame with PySpark (drop nulls and duplicates)

Question

I am new to Python/PySpark and I am having trouble cleansing the data before using it on my Mac's terminal. I want to delete any row that contains null values or repeated rows. I used .distinct() and tried with:

rw_data3 = rw_data.filter(rw_data.isNotNull())

I also tried...

from functools import reduce
rw_data.filter(~reduce(lambda x, y: x & y, [rw_data[c].isNull() for c in 
rw_data.columns])).show()

but I get

"AttributeError: 'RDD' object has no attribute 'isNotNull'"

or

"AttributeError: 'RDD' object has no attribute 'columns'"

Which clearly shows I do not really understand the syntax for cleaning up the DataFrame

Looks like you have an `rdd` and not a DataFrame. Try `print(type(rw_data3))` to find out for sure. — pault, Sep 17 '18 at 21:24

pault · Answer 1 · 2018-09-17T22:23:45.280

0

It looks like you have an rdd, and not a DataFrame. You can easily convert the rdd to a DataFrame and then use pyspark.sql.DataFrame.dropna() and pyspark.sql.DataFrame.dropDuplicates() to "clean" it.

clean_df = rw_data3.toDF().dropna().dropDuplicates()

Both of these functions accept and optional parameter subset, which you can use to specify a subset of columns to search for nulls and duplicates.

If you wanted to "clean" your data as an rdd, you can use filter() and distinct() as follows:

clean_rdd = rw_data2.filter(lambda row: all(x is not None for x in row)).distinct()

edited Sep 17 '18 at 22:23

answered Sep 17 '18 at 21:28

pault

41,343
15
107
149

Thanks, you are right. They are RDDs. Can I clean the RDDs as well? Or only the dataframes and then I have to convert them back into RDDs? Thanks again! – lauvdb Sep 17 '18 at 22:14
@lauvdb updated the answer for `rdd`s. You can operate on `rdd` but the DataFrame is [is generally preferred](https://stackoverflow.com/a/31508314/5858851), depending on [what you're trying to do](https://stackoverflow.com/a/44317177/5858851). – pault Sep 17 '18 at 22:23

How to clean rdd or DataFrame with PySpark (drop nulls and duplicates)

1 Answers1