Is it possible to drop Pyspark rows in map()?

Question

When performing a map function in Pyspark, I often want to drop data that fails the mapping function (in this example, converting to xml). I was wondering if there was a clean way to do this in the mapping step?

The obvious solution of returning blank still leaves an object in the RDD eg.

### **** skip pyspark boilerplate ****

### function defs
from lxml import etree as ET
def return_valid_xml(one_line_input):
    try:
        root = ET.fromstring(one_line_input)
        return root
    except:
        return

### code that returns stuff for every line of input
valid_xml_data = someDataStrings.map(lambda x: return_valid_xml(x))

Coming up with a clever filter is a waste of my time, and a dumb filter like a try/except on ET.fromstring() return true is a waste of computational time, as I parse the XML twice.

Possible duplicate of [What is the equivalent to scala.util.Try in pyspark?](https://stackoverflow.com/questions/33383275/what-is-the-equivalent-to-scala-util-try-in-pyspark) — 10465355, Oct 23 '18 at 23:23
You could try `flatMap`: return `[root]` on success and an empty list (`[]`) on failure. — pault, Oct 24 '18 at 01:17
Totally works (want to make it an answer?). Why do you need to return a list though? A `map` will successfully return `root`, but flatmap needs `[root]`. Very strange. — Mark_Anderson, Oct 24 '18 at 15:55

score 2 · Accepted Answer · answered Oct 24 '18 at 17:04

2

You could use flatMap and return an empty list on failure:

def return_valid_xml(one_line_input):
    try:
        root = ET.fromstring(one_line_input)
        return [root]
    except:
        return []

valid_xml_data = someDataStrings.flatMap(return_valid_xml)

Also you can just pass in return_valid_xml, instead of defining the lambda function.

answered Oct 24 '18 at 17:04

pault

41,343
15
107
149

Same question as before on why map will successfully return `root`, but flatmap needs `[root]`? :) – Mark_Anderson Oct 27 '18 at 16:42
`flatMap` requires an iterable to be returned so that it can be flattened. See more [here](https://stackoverflow.com/questions/22350722/what-is-the-difference-between-map-and-flatmap-and-a-good-use-case-for-each) and [here](https://stackoverflow.com/questions/42997900/apache-spark-comparison-of-map-vs-flatmap-vs-mappartitions-vs-mappartitionswith) – pault Oct 27 '18 at 17:06

Is it possible to drop Pyspark rows in map()?

1 Answers1