pyspark's flatMap in pandas

Question

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.

If this is a pure pandas question then it would help to more fully explain what you are trying to do (for folks not familiar with flatMap, which may be a lot of people here!). Sample data, desired results, etc. — JohnE, Jun 26 '15 at 20:19

score 8 · Answer 1 · answered Dec 31 '15 at 00:27

There's a hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

If you replace reset_index(drop=True) with reset_index(level=0, drop=True), this also maintains the old index — Gilthans, Feb 03 '21 at 14:05

score 1 · Answer 2 · edited Feb 16 '17 at 12:16

1

there are three steps to solve this question.

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

edited Feb 16 '17 at 12:16

H. Pauwelyn

13,575
26
81
144

answered Feb 16 '17 at 11:52

nikita

11
2

score 1 · Answer 3 · answered Jul 13 '21 at 07:17

Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?

flatmap operations should be a subset of map, not apply. check this thread for map/applymap/apply details Difference between map, applymap and apply methods in Pandas

import pandas as pd
from typing import Callable

def flatmap(
    self,
    func:Callable[[pd.Series],pd.Series],
    ignore_index:bool=False):
    return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap

# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
#    A   B
# 0  1   6
# 1  2   7
# 2  3   8
# 3  4   9
# 4  5  10
print(df.A.flatmap(range,False))
# 0    NaN
# 1      0
# 2      0
# 2      1
# 3      0
# 3      1
# 3      2
# 4      0
# 4      1
# 4      2
# 4      3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0     0
# 1     0
# 2     1
# 3     0
# 4     1
# 5     2
# 6     0
# 7     1
# 8     2
# 9     3
# 10    0
# 11    1
# 12    2
# 13    3
# 14    4
# Name: A, dtype: object

As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.

MRocklin · Answer 4 · 2015-06-26T23:01:01.873

-1

I suspect that the answer is "no, not efficiently."

Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

And that you want something like the following

It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

Generally one does a bit of munging of data before one uses tabular computation.

edited Jun 26 '15 at 23:01

answered Jun 26 '15 at 22:39

MRocklin

55,641
23
163
235

Can you refer to a reading reference supporting the "Pandas isn't built for nested data like this." statement? Me and other Pandas beginners would like to know more! :) – Tarrasch Oct 10 '16 at 08:55
Have you read through the Pandas documentation: http://pandas.pydata.org/ ? It's pretty comprehensive. – MRocklin Oct 10 '16 at 13:04
Related but contain more details https://stackoverflow.com/questions/53218931/how-do-i-unnest-a-column-in-a-pandas-dataframe – BENY Dec 01 '18 at 02:20
`flatMap` is used where a function returns a collection of values that you want to concatenate. This has nothing to do with normalisation. – RobinGower Jan 21 '22 at 10:50

pyspark's flatMap in pandas

4 Answers4

Linked