I have a data frame
in pyspark
like below.
df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 1| N|
| 2| Y|
| 3| N|
+---+----+
I want to delete a record when there is a duplicate id
and the test
is N
Now when I query the new_df
new_df.show()
+---+----+
| id|test|
+---+----+
| 1| Y|
| 2| Y|
| 3| N|
+---+----+
I am unable to figure out the use case.
I have done groupby on the id
count but it gives only the id
column and count
.
I have done like below.
grouped_df = new_df.groupBy("id").count()
How can I achieve my desired result
edit
I have a data frame like below.
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| N|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+
When I done like below
new_df = df.groupBy("sn").agg(max("attribute").alias("attribute"))
I am getting str has no attribute alias
error
The expected result should be like below
+-------------+--------------------+--------------------+
| sn| device| attribute|
+-------------+--------------------+--------------------+
|4MY16A5602E0A| Android Phone| N|
|4MY16A5W02DE8| Android Phone| Y|
|4VT1735J00337| TV| Y|
|4VT47B52003EE| Router| N|
|4VT47C5N00A10| Other| N|
+-------------+--------------------+--------------------+