8

What I have:

  • A dataframe with many rows, and several existing columns (python, pandas).
  • Python 3.6, so a solution that relies on that particular version is fine with me (but obviously solutions that also work for earlier versions are fine too)

What I want to do:

  • Add multiple additional columns to the dataframe, where the values in these new columns all depend on some way on values in existing columns in the same row.
  • The original order of the dataframe must be preserved. If a solution changes the ordering, I could restore it afterwards by manually sorting based on one of the existing columns, but obviously this introduces extra overhead.

I already have the following code, which does work correctly. However, profiling has indicated that this code is one of the important bottlenecks in my code, so I'd like to optimize it if possible, and I also have reason to believe that should be possible:

df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)

# a few more lines of code like the above

I based this solution on answers to questions like this one (which is a question similar to mine, but specifically about adding one new column, whereas my question is about adding many new columns). I suppose that each of these df.apply() calls is internally implemented with a loop through all the rows, and I suspect it should be possible to optimize this with a solution that only loops through all the loops once (as opposed to once per column I want to add).

In other answers, I have seen references to the assign() function, which does indeed support adding multiple columns at once. I tried using this in the following way:

# WARNING: this does NOT work
df = df.assign(
    NewColumn1=lambda row: compute_new_column1_value(row),
    NewColumn2=lambda row: compute_new_column2_value(row),
    # more lines like the two above
)

The reason why this doesn't work is because the lambda's actually don't receive rows of the dataframe as arguments at all, they simply seem to get the entire dataframe at once. And then it's expected for each of the lambda's to return a complete column/Series/array of values at once. So, my problem here is that I'd have to end up implementing manual loops through all the loops myself inside those lambda's, which is obviously going to be even worse for performance.

I can think of two solutions conceptually, but have been unable to find how to actually implement them so far:

  1. Something like df.assign() (which supports adding multiple columns at once), but with the ability to pass rows into the lambda's instead of the complete dataframe

  2. A way to vectorize my compute_new_columnX_value() functions, so that they can be used as lambda's in the way that df.assign() expects them to be used.

My problem with the second solution so far is that the row-based versions some of my functions look as follows, and I have difficulties finding how to properly vectorize them:

def compute_new_column1_value(row):
    if row["SomeExistingColumn"] in some_dictionary:
        return some_dictionary[row["SomeExistingColumn"]]
    else:
        return some_default_value
Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
  • Have you checked [this](https://stackoverflow.com/questions/16575868/efficiently-creating-additional-columns-in-a-pandas-dataframe-using-map) question? – Sohaib Farooqi Jan 19 '18 at 16:00
  • @GarbageCollector I hadn't seen that question yet. If I understand it correctly though, that `applymap()` function applies the same function to every cell in the dataframe to compute its new values. In my case, I want to apply multiple different functions (`compute_new_column1_value` can be quite different from `compute_new_column2_value`), so I think I'd end up with a similar solution as the one I already have, with multiple loops through all the rows? – Dennis Soemers Jan 19 '18 at 16:07
  • Where are some dataset that we can test on? We will also need expected output. – Tai Jan 20 '18 at 02:16

5 Answers5

4

Have you tried initializing the columns as nan, iterating through the dataframe by row, and assigning the values with loc?

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(0, 20, (10, 5)))

df[5] = np.nan
df[6] = np.nan

for i, row in df.iterrows():
    df.loc[i, 5] = row[1] + row[4]
    df.loc[i, 6] = row[3] * 2

print(df)

yields

    0   1   2   3   4
0  17   4   3  11  10
1  16   1  14  11  16
2   4  18  12  19   7
3  11   3   7  10   5
4  11   0  10   1  17
5   5  17  10   3   8
6   0   0   7   3   6
7   7  18  18  13   8
8  16   4  12  11  16
9  13   9  15   8  19

    0   1   2   3   4     5     6
0  17   4   3  11  10  14.0  22.0
1  16   1  14  11  16  17.0  22.0
2   4  18  12  19   7  25.0  38.0
3  11   3   7  10   5   8.0  20.0
4  11   0  10   1  17  17.0   2.0
5   5  17  10   3   8  25.0   6.0
6   0   0   7   3   6   6.0   6.0
7   7  18  18  13   8  26.0  26.0
8  16   4  12  11  16  20.0  22.0
9  13   9  15   8  19  28.0  16.0
Eric Ed Lohmar
  • 1,832
  • 1
  • 17
  • 26
  • I was under the impression that such a python-based loop through the dataframe was going to be significantly less efficient than using methods (like assign/apply) which have the loop implemented in C++ (under the assumption of a fairly standard installation, I suppose some unconventional ones may have different implementations than C++). But I suppose I can give it a try and see what the performance is like – Dennis Soemers Jan 20 '18 at 09:30
  • Just tried this (with a small change; I used `itertuples` instead of `iterrows`, since it's supposed to be faster according to, for example, this answer: https://stackoverflow.com/a/47604317/6735980). Unfortunately it's indeed about 8% slower than my existing solution. Turns out that a single python-based loop is a bit slower than multiple (9, to be specific) C++-based loops through the `apply` function – Dennis Soemers Jan 20 '18 at 10:55
2

If you only have 50 conditions to check for it is probably better to iterate through the conditions and fill the cells in blocks rather than going through the whole frame row by row. By the way .assign() doesn't just accept lambda functions and the code can also be made a lot more readable than in my previous suggestion. Below is a modified version that also fills the extra columns in place. If this data frame had 10,000,000 rows and I only wanted to apply different operations to 10 groups of number ranges in column A this would be a very neat way of filling the extra columns.

import pandas as pd
import numpy as np

# Create data frame
rnd = np.random.randint(1, 10, 10)
rnd2 = np.random.randint(100, 1000, 10)
df = pd.DataFrame(
        {'A': rnd, 'B': rnd2, 'C': np.nan, 'D': np.nan, 'E': np.nan })

# Define different ways of filling the extra cells
def f1():
    return df['A'].mul(df['B'])

def f2():
    return np.log10(df['A'])

def f3():
    return df['B'] - df['A']

def f4():
    return df['A'].div(df['B'])

def f5():
    return np.sqrt(df['B'])

def f6():
    return df['A'] + df['B']

# First assign() dependent on a boolean mask
df[df['A'] < 50] = df[df['A'] < 15].assign(C = f1(), D = f2(), E = f3())

# Second assign() dependent on a boolean mask
df[df['A'] >= 50] = df[df['A'] >= 50].assign(C = f4(), D = f5(), E = f6())

print(df)

     A      B       C         D    E
0  4.0  845.0  3380.0  0.602060  841
1  3.0  967.0  2901.0  0.477121  964
2  3.0  468.0  1404.0  0.477121  465
3  2.0  548.0  1096.0  0.301030  546
4  3.0  393.0  1179.0  0.477121  390
5  7.0  741.0  5187.0  0.845098  734
6  1.0  269.0   269.0  0.000000  268
7  4.0  731.0  2924.0  0.602060  727
8  4.0  193.0   772.0  0.602060  189
9  3.0  306.0   918.0  0.477121  303
KRKirov
  • 3,854
  • 2
  • 16
  • 20
  • This answer turned out to work best for me in cases where it's possible, e.g. cases where there are only a few different conditions to check for, and where large blocks of rows can easily be computed in a vectorized manner through numpy, as is the case for the example functions in this answer. For other cases (where vectorized computations are not feasible), I'd recommend checking the solution in the answer I posted myself. Ultimately, in my own code, I ended up going for a mix between this answer and my answer, different solutions for different columns – Dennis Soemers Jan 22 '18 at 18:51
1

Rather than trying to bring the row labels into .assign(), you can apply a boolean mask to your data frame before chaining .assign() to it. The example below can easily be extended to multiple boolean conditions and multiple lambdas with or without additional for loops or if statements.

import pandas as pd

# Create data frame
idx = np.arange(0, 10)
rnd = pd.Series(np.random.randint(10, 20, 10))
alpha_idx = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame({'idx': idx, 'A': rnd, 'B': 100})
df.index = alpha_idx

# First assign() dependent on a boolean mask
df_tmp = df[df['A'] < 15].assign(AmulB = lambda x: (x.A.mul(x.B)),
               A_B = lambda x: x.B - x.A)

# Second assign() dependent on a boolean mask
df_tmp2 = df[df['A'] >= 15].assign(AmulB = lambda x: (x.A.div(x.B)),
               A_B = lambda x: x.B + x.A)


# Create a new df with different lambdas combined
df_lambdas = df_tmp.append(df_tmp2)

# Sort values
df_lambdas.sort_values('idx', axis=0, inplace=True)
print(df_lambdas)

    A    B  idx
a  19  100    0
b  17  100    1
c  16  100    2
d  13  100    3
e  15  100    4
f  10  100    5
g  16  100    6
h  15  100    7
i  13  100    8
j  10  100    9 

    A    B  idx  A_B    AmulB
a  19  100    0  119     0.19
b  17  100    1  117     0.17
c  16  100    2  116     0.16
d  13  100    3   87  1300.00
e  15  100    4  115     0.15
f  10  100    5   90  1000.00
g  16  100    6  116     0.16
h  15  100    7  115     0.15
i  13  100    8   87  1300.00
j  10  100    9   90  1000.00
KRKirov
  • 3,854
  • 2
  • 16
  • 20
  • Ah, right, I forgot to mention that it's important for the ordering of the rows to be preserved in my case. I don't think that would necessarily happen in this implementation? It does in your example, because the values that you compare to (`A` column) happen to be nicely sorted, but that wouldn't be the case for me. I'll still give it a try, see what the difference in performance is, and worry about the ordering afterwards – Dennis Soemers Jan 20 '18 at 09:33
  • Perhaps you can use a list or a dictionary to store the mini-data frames and then append or concat using list indexing or the dictionary keys to restore the order in your original data frame. – KRKirov Jan 20 '18 at 10:48
  • That sounds... complicated. How'd I know when to continue an existing ''mini-dataframe'' or start a new one? That'd require a manually implemented python loop through all the rows I think? I suspect it'd be more efficient to just allow the ordering to get messed up and sort it afterwards? – Dennis Soemers Jan 20 '18 at 10:52
  • It just occurred to me that if your index is not a stadard range index you can have an additional column idx in you data frame (see the updated answer). Then you can apply different lambdas to blocks of rows that meet a certain condition, store them in mini-frames, append and then sort by the idx column. – KRKirov Jan 20 '18 at 11:07
  • I'll spend some time trying that solution later today. I also found some other potential solutions which look promising (which I was unable to find when I asked the question yesterday) here: https://stackoverflow.com/questions/23586510/return-multiple-columns-from-apply-pandas/35208597 . So I'll try them all and see what's faster in my case, then report back here – Dennis Soemers Jan 20 '18 at 11:49
  • This answer works kind of fine in the case of a single new column (like the new column `AmulB` in your example, but it becomes way too complex (especially from a code readability point of view, but also computationally) when you have to add multiple different new columns like I have to do, and each of those columns can have different conditions they check when assigning values to rows. – Dennis Soemers Jan 20 '18 at 15:40
1

The answers provided so far do not provide a speedup for my specific case, for reasons I provided in the comments. The best solution I've been able to find so far is primarily based on this answer to another question. It didn't provide me a large speedup (about 10%), but it's the best I've been able to do so far. I'd still be very much interested in faster solutions if they exist!

It turns out that, like the assign function, apply can in fact also be provided with lambda's that return a series of values for multiple columns at once, instead of only lambda's that return a single scalar. So, the fastest implementation I have so far looks as follows:

# first initialize all the new columns with standard values for entire df at once
# this turns out to be very important. Skipping this comes at a high computational cost
for new_column in ["NewColumn1", "NewColumn2", "etc."]:
    df[new_column] = np.nan

df = df.apply(compute_all_new_columns, axis=1)

And then, instead of having all those separate lambda's for all the different new columns, they're all implemented in the same function like this:

def compute_all_new_columns(row):
    if row["SomeExistingColumn"] in some_dictionary:
        row["NewColumn1"] = some_dictionary[row["SomeExistingColumn"]]
    else:
        row["NewColumn1"] = some_default_value

    if some_other_condition:
        row["NewColumn2"] = whatever
    else:
        row["NewColumn2"] = row["SomeExistingColumn"] * whatever

    # assign values to other new columns here

The resulting dataframe contains all the columns it previously did, plus values for all the new columns as inserted on a row-by-row basis by the compute_all_new_columns function. The original ordering is preserved. This solution contains no python-based loops (which are slow), and only a single loop through the rows ''behind the scenes'' as provided to us by the pandas apply function

Dennis Soemers
  • 8,090
  • 2
  • 32
  • 55
1

I am really taken by this question so here is another example involving external dictionaries:

import pandas as pd
import numpy as np

# Create data frame and external dictionaries
rnd = pd.Series(np.random.randint(10, 100, 10))

names = 'Rafael Roger Grigor Alexander Dominic Marin David Jack Stan Pablo'
name = names.split(' ')

surnames = 'Nadal Federer Dimitrov Zverev Thiem Cilic Goffin Sock Wawrinka Busta'
surname = surnames.split()

countries_str = ('Spain Switzerland Bulgaria Germany Austria Croatia Belgium USA Switzerland Spain')
country = countries_str.split(' ')

player = dict(zip(name, surname))
player_country = dict(zip(name, country))

df = pd.DataFrame(
        {'A': rnd, 'B': 100, 'Name': name, 'Points': np.nan, 'Surname': np.nan, 'Country': np.nan})

df = df[['A', 'B', 'Name', 'Surname', 'Country', 'Points']]
df.loc[9, 'Name'] = 'Dennis'

print(df)

# Functions to fill the empty columns
def f1():
    return df['A'].mul(df['B'])

def f2():
    return np.random.randint(1, 10)

def f3():
    return player[key]

def f4():
    return player_country[key]

def f5():
    return 'Unknown'

def f6():
    return 0

# .assign() dependent on a boolean mask
for key, value in player.items():
    df[df['Name'] == key] = df[df['Name'] == key].assign(
            Surname = f3(), Country = f4(), Points = f1())

df[df['Name']=='Dennis'] = df[df['Name'] == 'Dennis'].assign(
        Surname = f5(), Country = f5(), Points = f6())
df = df.sort_values('Points', ascending=False)
print(df)

     A      B       Name   Surname      Country  Points
1  97.0  100.0      Roger   Federer  Switzerland  9700.0
4  93.0  100.0    Dominic     Thiem      Austria  9300.0
8  92.0  100.0       Stan  Wawrinka  Switzerland  9200.0
5  86.0  100.0      Marin     Cilic      Croatia  8600.0
6  67.0  100.0      David    Goffin      Belgium  6700.0
7  61.0  100.0       Jack      Sock          USA  6100.0
0  35.0  100.0     Rafael     Nadal        Spain  3500.0
2  34.0  100.0     Grigor  Dimitrov     Bulgaria  3400.0
3  25.0  100.0  Alexander    Zverev      Germany  2500.0
9  48.0  100.0     Dennis   Unknown      Unknown     0.0
KRKirov
  • 3,854
  • 2
  • 16
  • 20
  • This answer is essentially the same as the other one you posted today, except for 1) added sorting, and 2) more complicated boolean-mask conditions, correct? I'll try timing it, but my main concern is that this conditional indexing (e.g. `df[df['Name'] == key]` is going to involve additional loops through all the rows ''behind the scenes'' (can't imagine how that would be implemented without looping through all rows). Will have to see if the introduced ability for vectorized computations of parts of new columns helps to offset these additional loops – Dennis Soemers Jan 21 '18 at 13:57
  • The previous answer I posted today only fills cells by mathematical operations performed on the existing data frame columns. This one loops through the values of an external dictionary in order to determine what to do and shows a more flexible choice of column filling. I might be wrong, but to me it seems faster to loop once to select row labels that meet a condition and then perform the desired operations in blocks. – KRKirov Jan 21 '18 at 14:21
  • Ok, I've tried this. In cases where it's possible to do something like the other answer you posted today, where new column values are easily be computed in a numerical manner, that's about 100 times faster than the solution in my answer. However, using values from existing columns in the same row as keys into dictionaries... that's not something that numpy can vectorize, so the solution doesn't work there. Your solution in this particular answer (looping through all the keys in the dictionary) isn't feasible/useful in my case, since I have a very large dictionary – Dennis Soemers Jan 21 '18 at 16:57
  • I wonder whether there is a way for you to prepare something like a pivot table from your data frame, fill the dictionary keys of interest as an extra column and then fill the new columns based on the now incorporated keys. This might turn out to be the same as what we have already done, I am not sure. But you have the data so you it might be worth considering. – KRKirov Jan 21 '18 at 17:19
  • I guess my use case with dictionaries is always going to be problematic. It's never going to be possible to really vectorize those numpy-style, so I don't think those are gonna get much faster than the solution I have with a single call to `apply`. However, only some of the new columns I'm computing are like that. I also have a few columns which don't have the weird conditions etc., so it looks like the 100-times speedup will be worth it to just separate those columns out and do those vectorized. – Dennis Soemers Jan 21 '18 at 17:44
  • Great! The other thought I had is whether you can turn your column data and dictionary data into sets and then compute the intersection to determine which keys are in both sets and then loop through the filtered set instead of through the whole dictionary. – KRKirov Jan 21 '18 at 17:58