Let's say I have a dataframe with columns x and y. I'd like to automatically split it into arrays (or series) that have the same names as the columns, process the data, and then later rejoin them. It's pretty straightforward to do this manually:
x, y = df.x, df.y
z = x + y # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)
But I'd like to automate this. It's easy to get a list of strings with df.columns, but I really want [x,y] rather than ['x','y']. The best I can do so far is to work around that with exec:
df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000), 'z':np.zeros(1000) })
def method1( df ):
for col in df.columns:
exec( col + ' = df.' + col + '.values')
z = x + y # in actual use case, there are hundreds of lines like this
for col in df.columns:
exec( 'df.' + col + '=' + col )
df = df_orig.copy()
method1( df ) # df appears to be view of global df, no need to return it
df1 = df
So there are 2 issues:
1) Using exec like this is generally not a good idea (and has already caused me a problem when I tried to combine this with numba) --or is that bad? It seems to work fine for series and arrays.
2) I'm not sure the best way to take advantage of views here. Ideally all that I really want to do here is use x as a view of df.x. I assume that is not possible where x is an array but maybe it is if x is a series?
The example above is for arrays, but ideally I'm looking for a solution that also applies to series. In lieu of that, solutions that work with one or the other are welcome of course.
Motivation:
1) Readability, which can partially be achieved with eval, but I don't believe eval can be used over multiple lines?
2) With multiple lines like z=x+y, this method is a little faster with series (2x or 3x in examples I've tried) and even faster with arrays (over 10x). See here: Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba