I have a pyspark.rdd.PipelinedRDD
called myRDD
. This is its sample content:
[((111, u'BB', u'A'), (444, u'BB', u'A')),
((222, u'BB', u'A'), (888, u'BB', u'A')),
((333, u'BB', u'B'), (999, u'BB', u'A')),...]
I need to delete all entries where the third column values do not coincide. The expected result is this one:
[((111, u'BB', u'A'), (444, u'BB', u'A')),
((222, u'BB', u'A'), (888, u'BB', u'A')),...]
How can I do it?