2

I have a dataframe that contains X & Y data in columns like this:

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']

np.random.seed(365)
df = pd.DataFrame(np.random.randint(0,10,size=(10, 6)), columns=df_cols)

   x1  y1  x2  y2  x3  y3
0   2   4   1   5   2   2
1   9   8   4   0   3   3
2   7   7   7   0   8   4
3   3   2   6   2   6   8
4   9   6   1   6   5   7
5   7   6   5   9   3   8
6   7   9   9   0   1   4
7   0   9   6   5   6   9
8   5   3   2   7   9   2
9   6   6   3   7   7   1

I need to call a function that takes one X & Y pair at a time and returns and updated X & Y pair (same length), and then either save that data to a new dataframe with the original column names, or replace the old X & Y data with the new data and keep the original column names.

For example, take this function below:

def samplefunc(x, y):
    x = x*y
    y = x/10
    return x, y

# Apply function to each x & y pair 
x1, y1 = samplefunc(df.x1, df.y1)
x2, y2 = samplefunc(df.x2, df.y2)
x3, y3 = samplefunc(df.x3, df.y3)

 # Save new/updated x & y pairs into new dataframe, preserving the original column names 
df_updated = pd.DataFrame({'x1': x1, 'y1': y1, 'x2': x2, 'y2': y2, 'x3': x3, 'y3': y3})

# Desired result:
In [36]: df_updated
Out[36]: 
   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7

But doing it this way is obviously really tedious and impossible for a huge dataset. The similar/related questions I've found perform a simple transformation on the data rather than calling a function, or they add new columns to the dataframe instead of replacing the originals.

I tried to apply @PaulH's answer to my dataset, but neither of them are working as it is unclear how to actually call the function inside of either method.

# Method 1
array = np.array(my_actual_df)
df_cols = my_actual_df.columns
dist = 0.04 # a parameter I need for my function 
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],

            # Call the function (not working): 
            [df['value'], df['value']] = samplefunc(df['value'], df['value']),
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)



# Method 2
df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
         
        # Call the function (not working)
        .assign(df['x'], df['y'] = samplefunc(df['x'], df['y']))
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

The actual function I need to call is from Arty's answer to this question: Resample trajectory to have equal euclidean distance in each sample

CentauriAurelius
  • 504
  • 3
  • 21
  • Does this answer your question? [How to apply a function to two columns of Pandas dataframe](https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe) – VirtualScooter Feb 17 '21 at 03:10
  • if the function doing separate action on x and y columns you could add a condition to check the column name and select different functions for x and y columns. This makes the whole process much easier – Amir Maleki Feb 17 '21 at 03:12
  • @VirtualScooter Thanks but no it doesnt answer my question because it creates a new column in the original dataframe rather than replacing the original data with the output. It also doesnt preserve the column names when it adds the new data. – CentauriAurelius Feb 17 '21 at 03:12
  • @AmirMaleki The actual function I'm using requires both x & y values at the same time as input, and returns both updated x & y – CentauriAurelius Feb 17 '21 at 03:13
  • 1
    kindly add a seed to your randomisation, so the data stays the same – sammywemmy Feb 17 '21 at 03:25
  • @sammywemmy done! – CentauriAurelius Feb 17 '21 at 03:28
  • even number of columns always? – Ch3steR Feb 17 '21 at 04:03
  • @Ch3steR yes always even number of columns – CentauriAurelius Feb 17 '21 at 04:04
  • @CentauriAurelius you should read the documentation on `DataFrame.assign` – Paul H Feb 17 '21 at 04:33
  • @Paul H the documentation shows how to use lambda functions with df.assign but I need to call a much more complicated function that is defined in a separate module, and that takes both X & Y at once and returns both updated X & Y. – CentauriAurelius Feb 17 '21 at 04:37
  • right, that's why I need to see that function you're calling – Paul H Feb 17 '21 at 04:38
  • @PaulH The actual function I'm calling is from the answer to this question (by Arty): https://stackoverflow.com/questions/64441803/resample-trajectory-to-have-equal-euclidean-distance-in-each-sample. – CentauriAurelius Feb 17 '21 at 04:40
  • I would modify that function to take the dataframe directly and pull out the columns as needed. Then you can `pipe` the stacked dataframe in the second method directly to it. You might need a `groupby` in there as well. – Paul H Feb 17 '21 at 04:43

3 Answers3

1

Use slicing and apply operations on those slices.

def samplefunc(x, y):
    x = x**2
    y = y/10
    return x, y

arr = df.to_numpy().astype(object) 
e_col = arr[:, ::2]
o_col =  arr[:, 1::2]
e_col, o_col = samplefunc(e_col, o_col)
arr[:, ::2] = e_col 
arr[:, 1::2] = o_col 
out = pd.DataFrame(arr, columns=df.columns)

   x1   y1  x2   y2  x3   y3
0   4  0.4   1  0.5   4  0.2
1  81  0.8  16  0.0   9  0.3
2  49  0.7  49  0.0  64  0.4
3   9  0.2  36  0.2  36  0.8
4  81  0.6   1  0.6  25  0.7
5  49  0.6  25  0.9   9  0.8
6  49  0.9  81  0.0   1  0.4
7   0  0.9  36  0.5  36  0.9
8  25  0.3   4  0.7  81  0.2
9  36  0.6   9  0.7  49  0.1
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
  • can you edit the answer to call the sample function I provided? (samplefunc) – CentauriAurelius Feb 17 '21 at 04:29
  • @CentauriAurelius No need to reshape actually, edited the answer. ;) – Ch3steR Feb 17 '21 at 04:59
  • Thanks, this is a lot more understandable for me, but the only issue is that it passes all the even columns and all the odd columns in one go. The function I'm working with requires that each X & Y pair are passed one at a time – CentauriAurelius Feb 17 '21 at 05:55
0

There are couple of ways you could do this, depending on how your real-life dataframe is constructed.

The first thing that comes to my mind is to fully stack the dataframe and the use numpy.select to compute your new values based on the labels' values. You can then pivot the dataframe back to its original form:

import numpy
import pandas

df_cols = ['x1', 'y1', 'x2', 'y2', 'x3', 'y3']


numpy.random.seed(365)
array = numpy.random.randint(0, 10, size=(10, 6))
df = (
    pandas.DataFrame(array, columns=df_cols)
        .rename_axis(index='idx', columns='label')
        .stack()
        .to_frame('value')
        .reset_index()
        .assign(value=lambda df: numpy.select(
            [df['label'].str.startswith('x'), df['label'].str.startswith('y')],
            [df['value'] ** 2, df['value'] / 10],
        ))
        .pivot(index='idx', columns='label', values='value')
        .loc[:, df_cols]
)
label    x1   y1    x2   y2    x3   y3
idx                                   
0       4.0  0.4   1.0  0.5   4.0  0.2
1      81.0  0.8  16.0  0.0   9.0  0.3
2      49.0  0.7  49.0  0.0  64.0  0.4
3       9.0  0.2  36.0  0.2  36.0  0.8
4      81.0  0.6   1.0  0.6  25.0  0.7
5      49.0  0.6  25.0  0.9   9.0  0.8
6      49.0  0.9  81.0  0.0   1.0  0.4
7       0.0  0.9  36.0  0.5  36.0  0.9
8      25.0  0.3   4.0  0.7  81.0  0.2
9      36.0  0.6   9.0  0.7  49.0  0.1

Alternatively, you could look at your column names as hierarchies, turn it into a multi-level index, and then stack only the second level of that index. That way, you end up with separate x- and y-column that you can operate on directly and explicitly

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .stack(level='group')
        .assign(x=lambda df: df['x'] ** 2, y=lambda df: df['y'] / 10)
        .unstack(level='group')
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

Paul H
  • 65,268
  • 20
  • 159
  • 136
  • Thank you, Im just unsure how to include actually calling the function inside of the code you provided. I added my attempt into my question. The function requires both x & y to be passed together, and then returns the updated x & y arrays together. – CentauriAurelius Feb 17 '21 at 04:03
  • @CentauriAurelius I think the second method is what you want then – Paul H Feb 17 '21 at 04:15
  • Its also unclear how to call my function inside the second method. I added my attempt for using the second method to my question as well. – CentauriAurelius Feb 17 '21 at 04:22
  • @CentauriAurelius where does `resample_euclid_equidist` come from? – Paul H Feb 17 '21 at 04:30
  • it is the actual function I need to apply to my df. I just edited the code to call 'samplefunc' instead. – CentauriAurelius Feb 17 '21 at 04:33
0

New approach here:

  • split the column into a multilevel index
  • do a horizontal groupby
  • modify your samplefunc to take a dataframe:
def samplefunc(df, xcol='x', ycol='y'):
    x = df[xcol].to_numpy()
    y = df[ycol].to_numpy()
    
    df[xcol] = x * y
    df[ycol] = x / 10
    return df

df = (
    pandas.DataFrame(array, columns=df_cols)
        .pipe(lambda df: df.set_axis(df.columns.map(lambda c: (c[0], c[1])), axis='columns'))
        .rename_axis(columns=['which', 'group'])
        .groupby(level='group', axis='columns')
        .apply(samplefunc)
        .pipe(lambda df: df.set_axis([''.join(c) for c in df.columns], axis='columns'))
)

And I get:

   x1   y1  x2   y2  x3   y3
0   8  0.8   5  0.5   4  0.4
1  72  7.2   0  0.0   9  0.9
2  49  4.9   0  0.0  32  3.2
3   6  0.6  12  1.2  48  4.8
4  54  5.4   6  0.6  35  3.5
5  42  4.2  45  4.5  24  2.4
6  63  6.3   0  0.0   4  0.4
7   0  0.0  30  3.0  54  5.4
8  15  1.5  14  1.4  18  1.8
9  36  3.6  21  2.1   7  0.7
Paul H
  • 65,268
  • 20
  • 159
  • 136
  • This is great for the simple sample function, but I was really hoping for an answer that didnt require changing the function (i.e., simply calling the function on every pair of 2 columns and getting two columns back). I'm worried if I tried to change the resample_euclid_equidist function I would break it or it would take forever to debug since its so enormous and complex (and I'm a mediocre programmer at best). – CentauriAurelius Feb 18 '21 at 00:25
  • @CentauriAurelius The way I changed the function is simply by unpacking/packing the dataframe columns into numpy arrays. Two lines are at the beginning. Two lines at the end. – Paul H Feb 18 '21 at 00:49