1

When performing a map function in Pyspark, I often want to drop data that fails the mapping function (in this example, converting to xml). I was wondering if there was a clean way to do this in the mapping step?

The obvious solution of returning blank still leaves an object in the RDD eg.

### **** skip pyspark boilerplate ****

### function defs
from lxml import etree as ET
def return_valid_xml(one_line_input):
    try:
        root = ET.fromstring(one_line_input)
        return root
    except:
        return

### code that returns stuff for every line of input
valid_xml_data = someDataStrings.map(lambda x: return_valid_xml(x))

Coming up with a clever filter is a waste of my time, and a dumb filter like a try/except on ET.fromstring() return true is a waste of computational time, as I parse the XML twice.

pault
  • 41,343
  • 15
  • 107
  • 149
Mark_Anderson
  • 1,229
  • 1
  • 12
  • 34
  • 1
    Possible duplicate of [What is the equivalent to scala.util.Try in pyspark?](https://stackoverflow.com/questions/33383275/what-is-the-equivalent-to-scala-util-try-in-pyspark) – 10465355 Oct 23 '18 at 23:23
  • You could try `flatMap`: return `[root]` on success and an empty list (`[]`) on failure. – pault Oct 24 '18 at 01:17
  • 1
    Totally works (want to make it an answer?). Why do you need to return a list though? A `map` will successfully return `root`, but flatmap needs `[root]`. Very strange. – Mark_Anderson Oct 24 '18 at 15:55

1 Answers1

2

You could use flatMap and return an empty list on failure:

def return_valid_xml(one_line_input):
    try:
        root = ET.fromstring(one_line_input)
        return [root]
    except:
        return []

valid_xml_data = someDataStrings.flatMap(return_valid_xml)

Also you can just pass in return_valid_xml, instead of defining the lambda function.

pault
  • 41,343
  • 15
  • 107
  • 149
  • Same question as before on why map will successfully return `root`, but flatmap needs `[root]`? :) – Mark_Anderson Oct 27 '18 at 16:42
  • `flatMap` requires an iterable to be returned so that it can be flattened. See more [here](https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each) and [here](https://stackoverflow.com/questions/42997900/apache-spark-comparison-of-map-vs-flatmap-vs-mappartitions-vs-mappartitionswith) – pault Oct 27 '18 at 17:06