Equivalent to pySpark flatMap in Python

Question

I was searching for a function to flatten an array of lists. First, I implemented my solution using the Apach Spark function flatMap on RDD system, but I would like to do this locally. However, I can't manage to find the equivalent of

samples = filtered_tiles.flatMap(lambda tile: process_tile(tile, sample_size, grayscale))

in Python 3. Is there any workaround?

The array format is:

samples = [(slide_num, sample)]

Kind regards

Possible duplicate of [How to make a flat list out of list of lists?](https://stackoverflow.com/questions/952914/how-to-make-a-flat-list-out-of-list-of-lists) — Matt Messersmith, Nov 27 '18 at 23:41
Not quite. That answer is to only flatten a comprehensive list. I want to apply the map function and flatten the result. — vftw, Nov 28 '18 at 00:57
The do that. Call `map` with a lambda that you paas in, then flatten with the linked answer, and put it all in a function called `flatMap`. Calling `map` is trivial: it's flattening that's the issue, which already has a solution. — Matt Messersmith, Nov 28 '18 at 01:13

Matt Messersmith · Accepted Answer · 2018-12-03T11:24:21.150

Here's an example of PySpark's flatMap on an RDD:

sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

which will yield

[1, 2, 1, 2, 3, 1, 2, 3, 4]

as opposed to just map which would yield [[1, 2], [1, 2, 3], [1, 2, 3, 4]] (for comparison).

flatMap also only does one level of "unnesting". In other words, if you have a 3d list, it will only flatten it to a 2d list. So, we'll make our flattener do this too.

As alluded to in the comments, all you have to do is call the built-in map, and create a flattening function, and chain them together. Here's how:

def flatMap(f, li):
    mapped = map(f, li)
    flattened = flatten_single_dim(mapped)
    yield from flattened

def flatten_single_dim(mapped):
    for item in mapped:
        for subitem in item:
            yield subitem

going back to our example as a quick sanity check:

res = flatMap(lambda x: range(1, x), [3,4,5])
print(list(res))

which outputs:

[1, 2, 1, 2, 3, 1, 2, 3, 4]

as desired. You'd do flatMap(lambda tile: process_tile(tile, sample_size, grayscale), filtered_tiles) (given filtered_tiles is an iterable).

P.S. As a side note, you can run Spark in "local" mode, and just call flatMap on RDDs. It'll work just fine for prototyping small stuff on your local machine. Then you can hook into a cluster with some cluster manager when you're ready to scale and have TBs of data you need to rip though.

HTH.

Do you get three cores working when doing this? I have been unable to make spark use more than one core, when the initial `parallelize([3,4,5])` is so short. I've tried adding the second parameter to `parallelize` as well without luck, including changing the spark settings etc. — Thomas Ahle, Jan 15 '20 at 14:32
@ThomasAhle Are you talking about running spark in local mode and getting it to invoke multiple cores? I'm not all that familiar with local mode, but when you're in cluster mode it's sort of complicated as to how the engine decides to distribute the work. Long story short, I don't think you can "force" it to use 3 cores. All you can say is "here, take these 3 cores". Maybe try it with more data, and it'll be more likely to "want" to consume more resources. With very small data (like above), you'd likely take a performance _hit_ doing it multithreaded/parallel anyway. — Matt Messersmith, Jan 17 '20 at 15:20

Equivalent to pySpark flatMap in Python

1 Answers1