7

I am writing a map method using

RDD.map(lambda line: my_method(line))

and based on a particular condition in my_method (let's say line starts with 'a'), I want to either return a particular value otherwise ignore this item all together.

For now, I am returning -1 if the condition is not met on the item and later using another

RDD.filter() method to remove all the ones with -1.

Any better way to be able to ignore these items by returning null from my_method?

London guy
  • 27,522
  • 44
  • 121
  • 179

3 Answers3

13

In case like this flatMap is your friend:

  1. Adjust my_method so it returns either a single element list or an empty list (or create a wrapper like here What is the equivalent to scala.util.Try in pyspark?)

    def my_method(line):
        return [line.lower()] if line.startswith("a") else []
    
  2. flatMap

    rdd = sc.parallelize(["aDSd", "CDd", "aCVED"])
    
    rdd.flatMap(lambda line: my_method(line)).collect()
    ## ['adsd', 'acved']
    
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
2

If you want to ignore the items based on some condition, then why not use filter by itself? Why use a map? If you want to transform it, you can use map on the output from filter.

Dmitry Rubanovich
  • 2,471
  • 19
  • 27
-1

filter is transformation method. It is high-cost operation because of creating new RDD.

keeptalk
  • 35
  • 2