When performing a map function in Pyspark, I often want to drop data that fails the mapping function (in this example, converting to xml). I was wondering if there was a clean way to do this in the mapping step?
The obvious solution of returning blank still leaves an object in the RDD eg.
### **** skip pyspark boilerplate ****
### function defs
from lxml import etree as ET
def return_valid_xml(one_line_input):
try:
root = ET.fromstring(one_line_input)
return root
except:
return
### code that returns stuff for every line of input
valid_xml_data = someDataStrings.map(lambda x: return_valid_xml(x))
Coming up with a clever filter is a waste of my time, and a dumb filter like a try/except on ET.fromstring()
return true
is a waste of computational time, as I parse the XML twice.