0

Let's say I have a dataframe with columns x and y. I'd like to automatically split it into arrays (or series) that have the same names as the columns, process the data, and then later rejoin them. It's pretty straightforward to do this manually:

x, y = df.x, df.y
z = x + y   # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)

But I'd like to automate this. It's easy to get a list of strings with df.columns, but I really want [x,y] rather than ['x','y']. The best I can do so far is to work around that with exec:

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   for col in df.columns:
      exec( col + ' = df.' + col + '.values')

   z = x + y   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      exec( 'df.' + col + '=' + col )

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

So there are 2 issues:

1) Using exec like this is generally not a good idea (and has already caused me a problem when I tried to combine this with numba) --or is that bad? It seems to work fine for series and arrays.

2) I'm not sure the best way to take advantage of views here. Ideally all that I really want to do here is use x as a view of df.x. I assume that is not possible where x is an array but maybe it is if x is a series?

The example above is for arrays, but ideally I'm looking for a solution that also applies to series. In lieu of that, solutions that work with one or the other are welcome of course.

Motivation:

1) Readability, which can partially be achieved with eval, but I don't believe eval can be used over multiple lines?

2) With multiple lines like z=x+y, this method is a little faster with series (2x or 3x in examples I've tried) and even faster with arrays (over 10x). See here: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • I'm assuming your operations are too complex, but have you considered just using `df.eval`? I guess I'm having trouble seeing the goal other than shortening code just slightly? – chrisb Sep 17 '14 at 17:27
  • 1
    you can access the columns using subscript operator like so `df[col]` the col here can be a string, this is a lot less hassle than using `exec` – EdChum Sep 17 '14 at 17:28
  • Note that 'z = x + y' is just a placeholder for what might be over a thousand such lines of code. For that many lines of code it's worthwhile to me just to get rid of the extra df. or df[] around everthing but even more important is the significant increase in speed. – JohnE Sep 17 '14 at 18:26
  • You might show an example of where you get a speed increase? – chrisb Sep 17 '14 at 19:45
  • @chrisb benchmarks http://stackoverflow.com/questions/25915541/fastest-way-to-numerically-process-2d-array-dataframe-vs-series-vs-array-vs-num – JohnE Sep 18 '14 at 14:34

2 Answers2

1

Just use indexing notation and a dictionary, instead of attribute notation.

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   series = {}
   for col in df.columns:
      series[col] = df[col]

   series['z'] = series['x'] + series['y']   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      df[col] = series[col]

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df
Mark Whitfield
  • 2,470
  • 1
  • 12
  • 12
  • Thanks Mark, I guess I could combine this dictionary solution with chrisb's context manager to get a pretty complete solution, although that's starting to get complicated... – JohnE Sep 17 '14 at 21:09
1

This doesn't do exactly what you want, but another path to think about.

There's a gist here that defines a context manager that allows you to reference columns as if they were locals. I didn't write this, and it's a little old, but still seems to work with the current version of pandas.

In [45]: df = pd.DataFrame({'x': np.random.randn(100000), 'y': np.random.randn(100000)})

In [46]: with DataFrameContextManager(df):
    ...:     z = x + y
    ...:     

In [47]: z.head()
Out[47]: 
0   -0.821079
1    0.035018
2    1.180576
3   -0.155916
4   -2.253515
dtype: float64
chrisb
  • 49,833
  • 8
  • 70
  • 70
  • Thanks, I'll take a look at that. I asked a related question here http://stackoverflow.com/questions/25856250/apply-a-mask-to-multiple-lines-syntactic-sugar and the answers also involved context managers. It certainly looks interesting, I'll have to explore and get a better sense of how context managers work. – JohnE Sep 17 '14 at 21:02