1

I need to "apply" a function to a DataFrame row by row, by taking as input two particular cells of the current row for performing an operation. The function is the following:

def function(x, y):
    z = 2*x*y
    values.append(z)
    return z

The problem is that the function shouldn't be really applied, I need only the input values to perform some operations and fill the global list called values. If we suppose the pd.DataFrame to be the following:

| col1 | col2 | col3 |
| 2    | 3    | 5    |
| 10   | 12   | 14   |
| ...  | ...  | ...  |

I would usually apply the function like this:

df.apply(lambda x: function(x['col2'], x['col3']), axis=1)

The problem with apply is that the last line of code would create a pd.Series and I would actually have in my memory not only the global list values that I need for other purposes (I used this list as an example for some other data structure that could be created starting from the function) but also this Series that I don't need at all.

How can I apply the function without occupying additional memory?

Nicola Fanelli
  • 502
  • 5
  • 11

2 Answers2

1

This operation can already be directly vectorized by-row, so you can avoid using .apply(), which will be tremendously faster
Canonical Answer for How to iterate over rows in a DataFrame in Pandas?

You won't be able to avoid using memory for the results because they need to go somewhere, but you could throw out columns you no longer need before or after performing the calculation

Just keeping the results in a dataframe column (Series) rather than a list of native ints will be a memory savings, but you may find that explicitly setting or reducing the datatypes of your dataframe is a big savings if they're not in their most efficient types already (for example from int64 to uint16 or even uint8 (which will still contain the example values)

>>> df = pd.DataFrame({"col1": [2,10], "col2": [3,12], "col3": [5,4]})
>>> df
   col1  col2  col3
0     2     3     5
1    10    12     4
>>> df["2xy"] = 2 * df["col2"] * df["col3"]
>>> df
   col1  col2  col3  2xy
0     2     3     5   30
1    10    12     4   96
ti7
  • 16,375
  • 6
  • 40
  • 68
  • Thank you! But what if the function is more complex, for example, if I have to create a global graph or tree. Is iterating my best shot in that case? – Nicola Fanelli Jan 20 '22 at 15:32
  • 2
    tbh I'd follow the advice here, trying explicitly to avoid iteration and use an existing method to get what you want, even if you have to transform the input data a little to get that result (because the transformation will also be vectorized, many extremely fast operations are overall much faster than a single very slow operation) https://stackoverflow.com/a/55557758/4541045 .. it's often worth doing both actions on a sample to show the results are the same and the speedup exists, but then recording that you did this research and using the faster method – ti7 Jan 20 '22 at 15:35
1

This seems too simple so I may be missing something, but couldn't you do this in... a loop?

def function(x, y):
    z = 2*x*y
    return z

for i, row in df.iterrows():
    values.append(function(row['col2'], row['col3']))

Would solve the literal problem you raised of creating a second object aside from values in memory to store the results.

bsauce
  • 624
  • 4
  • 12
  • Yes, this was what I would have done as an alternative. But I was searching for a more "pythonic" way, without iterating on the df. Thank you. – Nicola Fanelli Jan 20 '22 at 15:24