0

I'm trying to find a faster way to apply a function several times to a set of data housed in DataFrames.

I have two DataFrames:

  1. Parameters: has a column for each argument of the function, each row is a specific parameter set. There is also a column giving a unique name to each set.
  2. Original Data: houses original data in a column

For each set of parameters, I want to add a column to the original DataFrame with the result from "func" and set the column name to the parameter set name.

Currently I'm looping through the rows of the parameter DataFrame, but I feel like there's a better way to do it.

I'm trying to see if there's a vectorized solution, but so far I've been unsuccessful working with two DataFrames.

I've tried following cs95's answer in this post, but almost all of the examples for vectorization or list comprehensions are only dealing with a single DataFrame: How to iterate over rows in a DataFrame in Pandas

Is there a better way to do this?

I feel like there maybe something obvious I'm missing.

import pandas as pd

def func(data, a, b, c):
    return data["original"] + a + b * c


parameters = pd.DataFrame(
    {
        "name": ["set_1", "set_2", "set_3"],
        "a": [1, 2, 3],
        "b": [4, 5, 6],
        "c": [7, 8, 9],
    }
)

data = pd.DataFrame({"original": [10, 11, 12, 13, 14, 15]})

for i, row in parameters.iterrows():
    data[row["name"]] = func(data, row["a"], row["b"], row["c"])

    Inputs:
    
Parameters DataFrame:
    name  a  b  c
0  set_1  1  4  7
1  set_2  2  5  8
2  set_3  3  6  9
    
Original Data DataFrame:
       original
    0        10
    1        11
    2        12
    3        13
    4        14
    5        15
    

    Output:
    
       original  set_1  set_2  set_3
    0        10     39     52     67
    2        12     41     54     69
    3        13     42     55     70
    4        14     43     56     71
    5        15     44     57     72
recneps15
  • 1
  • 1

3 Answers3

0

No direct use of looping (implied in apply())

  • merge data together using a Cartesian product
  • do the calc with apply()
  • reshape using pivot() to your desired output structure
df = pd.read_csv(io.StringIO("""       original
    0        10
    1        11
    2        12
    3        13
    4        14
    5        15"""), sep="\s+")

dfp = pd.read_csv(io.StringIO("""    name  a  b  c
0  set_1  1  4  7
1  set_2  2  5  8
2  set_3  3  6  9"""), sep="\s+")

# catesian product data with params
dfm = df.assign(foo=1).merge(dfp.assign(foo=1), on="foo")
# do the calc
dfm = dfm.assign(calc=dfm.apply(lambda x: x.original + x.a + x.b * x.c, axis=1))
# reshape
dfm = dfm.pivot(index="original", columns="name", values="calc").reset_index()
original set_1 set_2 set_3
0 10 39 52 67
1 11 40 53 68
2 12 41 54 69
3 13 42 55 70
4 14 43 56 71
5 15 44 57 72
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
0

You could modify the code below to suit your expectation, I left the row 1 (original = 11) in there; should be easy to modify.

The code below avoids iterrows, as it can be quite slow. What it does is get the values from parameters first, then do the multiplication with data['original'] for each column to get the final output:

def func(left_df, right_df):
    right_df = right_df.copy()
    new_headers = right_df["name"].array
    right_df = (right_df["a"] + right_df["b"] * right_df["c"]).array
    right_df = dict(zip(new_headers, new))
    return left_df.assign(
        **{key: left_df["original"] + value for key, value in right_df.items()}
    )


data.pipe(func, parameters)


    original    set_1   set_2   set_3
0        10       39      52    67
1        11       40      53    68
2        12       41      54    69
3        13       42      55    70
4        14       43      56    71
5        15       44      57    72
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

Define your function as follows:

def func2(dataCol, a, b, c):
    return dataCol[:, np.newaxis] + a[np.newaxis, :] + b[np.newaxis, :] * c[np.newaxis, :]

Differences:

  • the first parameter is a source data column, not the whole DataFrame,
  • 3 remaining parameters are also columns (taken from parameters), instead of individual values.

Then, to get your result, call it the following way:

data[parameters.name.tolist()] = func2(data.original, parameters.a,
    parameters.b, parameters.c)

Using %timeit I checked the execution times of your and my code. My code executes in about 60 % of the time measured for your code.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41