3

I am building an RDD from a text file. Some of the lines do not conform to the format I am expecting, in which case I use the marker -1.

def myParser(line):
    try:
        # do something
    except:
        return (-1, -1), -1

lines = sc.textFile('path_to_file')
pairs = lines.map(myParser)

is it possible to remove the lines with the -1 marker? If not, what would be the workaround for it?

zero323
  • 322,348
  • 103
  • 959
  • 935
Bob
  • 849
  • 5
  • 14
  • 26

1 Answers1

3

The cleanest solution I can think of is to discard malformed lines using a flatMap:

def myParser(line):
    try:
        # do something
        return [result] # where result is the value you want to return
    except:
        return []

sc.textFile('path_to_file').flatMap(myParser)

See also What is the equivalent to scala.util.Try in pyspark?

You can also filter after the map:

pairs = lines.map(myParser).filter(lambda x: x != ((-1, -1), -1))
Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935