1

I'm trying to apply a function to a DataFrame at row wise(axis = 1), and when the applied function return a series, the final returned value of 'apply' would be a dataframe, which is not what I want. I've find a similar problem here, Returning multiple values from pandas apply on a DataFrame, however this case is about applying function to a groupby. and in the case of non-group, a dataframe will be returned even if the returned series of applied function are with different length.

In [10]: import pandas as pd

In [11]: import numpy as np

In [12]: df = pd.DataFrame({'start': [1, 2, 3], 'end': [7, 9, 9]})

In [13]: df
Out[13]:
   end  start
0    7      1
1    9      2
2    9      3

In [14]: def fun(df):
    ...:     return pd.Series(np.arange(df['start'], df['end'], 1))
    ...:

In [15]: df.apply(fun, axis=1)
Out[15]:
     0    1    2    3    4    5    6
0  1.0  2.0  3.0  4.0  5.0  6.0  NaN
1  2.0  3.0  4.0  5.0  6.0  7.0  8.0
2  3.0  4.0  5.0  6.0  7.0  8.0  NaN

however, what I want is something like this(a hierarchical series):

Out[23]:
0  0    1.0
   1    2.0
   2    3.0
   3    4.0
   4    5.0
   5    6.0
1  0    2.0
   1    3.0
   2    4.0
   3    5.0
   4    6.0
   5    7.0
   6    8.0
2  0    3.0
   1    4.0
   2    5.0
   3    6.0
   4    7.0
   5    8.0
dtype: float64
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
Woods Chen
  • 574
  • 3
  • 13
  • Can you add some data sample? – jezrael Jun 13 '18 at 06:47
  • Welcome to StackOverflow. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jun 13 '18 at 06:48
  • OK, wait a moment, thank you. – Woods Chen Jun 13 '18 at 06:49
  • In [10]: import pandas as pd In [11]: import numpy as np In [12]: df = pd.DataFrame({'start': [1, 2, 3], 'end': [7, 9, 9]}) In [13]: df Out[13]: end start 0 7 1 1 9 2 2 9 3 In [14]: def fun(df): ...: return pd.Series(np.arange(df['start'], df['end'], 1)) ...: In [15]: df.apply(fun, axis=1) Out[15]: 0 1 2 3 4 5 6 0 1.0 2.0 3.0 4.0 5.0 6.0 NaN 1 2.0 3.0 4.0 5.0 6.0 7.0 8.0 2 3.0 4.0 5.0 6.0 7.0 8.0 NaN – Woods Chen Jun 13 '18 at 06:49
  • Please edit question :) – jezrael Jun 13 '18 at 06:49
  • I'm trying to edit, my first time post questions – Woods Chen Jun 13 '18 at 06:57
  • OK, but what need as expected output? – jezrael Jun 13 '18 at 07:00
  • Or need `def fun(df): return np.arange(df['start'], df['end'], 1)` ? – jezrael Jun 13 '18 at 07:03
  • I need a stacked(hierarchical indexed) series. OK, I need to edit the question again. – Woods Chen Jun 13 '18 at 07:13

1 Answers1

0

Here apply convert values to DataFrame be design, so possible solutions are use stack:

s = df.apply(fun, axis=1).stack()
print (s)
0  0    1.0
   1    2.0
   2    3.0
   3    4.0
   4    5.0
   5    6.0
1  0    2.0
   1    3.0
   2    4.0
   3    5.0
   4    6.0
   5    7.0
   6    8.0
2  0    3.0
   1    4.0
   2    5.0
   3    6.0
   4    7.0
   5    8.0
dtype: float64

Or list comprehension with concat:

L = [pd.Series(np.arange(a, b)) for a, b in zip(df['start'], df['end'])]
s = pd.concat(L, keys=df.index)
print (s)
0  0    1
   1    2
   2    3
   3    4
   4    5
   5    6
1  0    2
   1    3
   2    4
   3    5
   4    6
   5    7
   6    8
2  0    3
   1    4
   2    5
   3    6
   4    7
   5    8
dtype: int32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you jezrael, the problem is I have a magnitude large data, which may exhaust my whole RAM, if 'apply' function return a DataFram, which may have too many columns (the number of columns is equal the longest series returned by the applied function). and the same, I can't use 'for' due to too many rows. – Woods Chen Jun 13 '18 at 07:33
  • @WoodsChen - Second solution is not possible use? – jezrael Jun 13 '18 at 07:48
  • no, I've also tried, that may exhaust the memory even faster maybe due to 'for' loop and zip operation. – Woods Chen Jun 13 '18 at 08:18
  • @WoodsChen - hmmm, and if use `return np.arange(df['start'], df['end'], 1)` it working? – jezrael Jun 13 '18 at 08:30
  • hmmm, I tried, failed again. when using Python 3.6.2/ IPython 6.2.1/ Windows: it raised an error due to the mismatch of columns. when using Python 3.6.5/ IPython 6.4.0/ Ubuntu: It returned a series with all the values being a list (the returned value of np.arange) and that's still not what I want. – Woods Chen Jun 13 '18 at 08:59