11

Is there an operation in pandas that does the same as flatMap in pyspark?

flatMap example:

>>> rdd = sc.parallelize([2, 3, 4])
>>> sorted(rdd.flatMap(lambda x: range(1, x)).collect())
[1, 1, 1, 2, 2, 3]

So far I can think of apply followed by itertools.chain, but I am wondering if there is a one-step solution.

JohnE
  • 29,156
  • 8
  • 79
  • 109
GeauxEric
  • 2,814
  • 6
  • 26
  • 33
  • 3
    If this is a pure pandas question then it would help to more fully explain what you are trying to do (for folks not familiar with flatMap, which may be a lot of people here!). Sample data, desired results, etc. – JohnE Jun 26 '15 at 20:19

4 Answers4

8

There's a hack. I often do something like

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df['x'].apply(pd.Series).unstack().reset_index(drop=True)
Out[3]:
0     1
1     3
2     2
3     4
4   NaN
5     5
dtype: float64

The introduction of NaN is because the intermediate object creates a MultiIndex, but for a lot of things you can just drop that:

In [4]: df['x'].apply(pd.Series).unstack().reset_index(drop=True).dropna()
Out[4]:
0    1
1    3
2    2
3    4
5    5
dtype: float64

This trick uses all pandas code, so I would expect it to be reasonably efficient, though it might not like things like very different sized lists.

santon
  • 4,395
  • 1
  • 24
  • 43
  • If you replace reset_index(drop=True) with reset_index(level=0, drop=True), this also maintains the old index – Gilthans Feb 03 '21 at 14:05
1

there are three steps to solve this question.

import pandas as pd
df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})
df_new = df['x'].apply(pd.Series).unstack().reset_index().dropna()
df_new[['level_1',0]]`

result picture

H. Pauwelyn
  • 13,575
  • 26
  • 81
  • 144
nikita
  • 11
  • 2
1

Since July 2019, Pandas offer pd.Series.explode to unnest frames. Here's a possible implementation of pd.Series.flatmap based on explode and map. Why?

import pandas as pd
from typing import Callable

def flatmap(
    self,
    func:Callable[[pd.Series],pd.Series],
    ignore_index:bool=False):
    return self.map(func).explode(ignore_index)
pd.Series.flatmap = flatmap

# example
df = pd.DataFrame([(x,y) for x,y in zip(range(1,6),range(6,16))], columns=['A','B'])
print(df.head(5))
#    A   B
# 0  1   6
# 1  2   7
# 2  3   8
# 3  4   9
# 4  5  10
print(df.A.flatmap(range,False))
# 0    NaN
# 1      0
# 2      0
# 2      1
# 3      0
# 3      1
# 3      2
# 4      0
# 4      1
# 4      2
# 4      3
# Name: A, dtype: object
print(df.A.flatmap(range,True))
# 0     0
# 1     0
# 2     1
# 3     0
# 4     1
# 5     2
# 6     0
# 7     1
# 8     2
# 9     3
# 10    0
# 11    1
# 12    2
# 13    3
# 14    4
# Name: A, dtype: object

As you can see, the main issue is the indexing. You could ignore it and just reset, but then you're better of using NumPy or std lists, as indexing is one of the key pandas' points. If you do not care about indexing at all, you could reuse the idea of the solution above, change pd.Series.map to pd.DataFrame.applymap and pd.Series.explode to pd.DataFrame.explode and forcing ignore_index=True.

Théophile Pace
  • 826
  • 5
  • 14
-1

I suspect that the answer is "no, not efficiently."

Pandas isn't built for nested data like this. I suspect that the case you're considering in Pandas looks a bit like the following:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': [[1, 2], [3, 4, 5]]})

In [3]: df
Out[3]: 
           x
0     [1, 2]
1  [3, 4, 5]

And that you want something like the following

    x
0   1
0   2
1   3
1   4
1   5

It is far more typical to normalize your data in Python before you send it to Pandas. If Pandas did do this then it would probably only be able to operate at slow Python speeds rather than fast C speeds.

Generally one does a bit of munging of data before one uses tabular computation.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Can you refer to a reading reference supporting the "Pandas isn't built for nested data like this." statement? Me and other Pandas beginners would like to know more! :) – Tarrasch Oct 10 '16 at 08:55
  • Have you read through the Pandas documentation: http://pandas.pydata.org/ ? It's pretty comprehensive. – MRocklin Oct 10 '16 at 13:04
  • Related but contain more details https://stackoverflow.com/questions/53218931/how-do-i-unnest-a-column-in-a-pandas-dataframe – BENY Dec 01 '18 at 02:20
  • `flatMap` is used where a function returns a collection of values that you want to concatenate. This has nothing to do with normalisation. – RobinGower Jan 21 '22 at 10:50