What I have:
- A dataframe with many rows, and several existing columns (python, pandas).
- Python 3.6, so a solution that relies on that particular version is fine with me (but obviously solutions that also work for earlier versions are fine too)
What I want to do:
- Add multiple additional columns to the dataframe, where the values in these new columns all depend on some way on values in existing columns in the same row.
- The original order of the dataframe must be preserved. If a solution changes the ordering, I could restore it afterwards by manually sorting based on one of the existing columns, but obviously this introduces extra overhead.
I already have the following code, which does work correctly. However, profiling has indicated that this code is one of the important bottlenecks in my code, so I'd like to optimize it if possible, and I also have reason to believe that should be possible:
df["NewColumn1"] = df.apply(lambda row: compute_new_column1_value(row), axis=1)
df["NewColumn2"] = df.apply(lambda row: compute_new_column2_value(row), axis=1)
# a few more lines of code like the above
I based this solution on answers to questions like this one (which is a question similar to mine, but specifically about adding one new column, whereas my question is about adding many new columns). I suppose that each of these df.apply()
calls is internally implemented with a loop through all the rows, and I suspect it should be possible to optimize this with a solution that only loops through all the loops once (as opposed to once per column I want to add).
In other answers, I have seen references to the assign() function, which does indeed support adding multiple columns at once. I tried using this in the following way:
# WARNING: this does NOT work
df = df.assign(
NewColumn1=lambda row: compute_new_column1_value(row),
NewColumn2=lambda row: compute_new_column2_value(row),
# more lines like the two above
)
The reason why this doesn't work is because the lambda's actually don't receive rows of the dataframe as arguments at all, they simply seem to get the entire dataframe at once. And then it's expected for each of the lambda's to return a complete column/Series/array of values at once. So, my problem here is that I'd have to end up implementing manual loops through all the loops myself inside those lambda's, which is obviously going to be even worse for performance.
I can think of two solutions conceptually, but have been unable to find how to actually implement them so far:
Something like
df.assign()
(which supports adding multiple columns at once), but with the ability to pass rows into the lambda's instead of the complete dataframeA way to vectorize my
compute_new_columnX_value()
functions, so that they can be used as lambda's in the way thatdf.assign()
expects them to be used.
My problem with the second solution so far is that the row-based versions some of my functions look as follows, and I have difficulties finding how to properly vectorize them:
def compute_new_column1_value(row):
if row["SomeExistingColumn"] in some_dictionary:
return some_dictionary[row["SomeExistingColumn"]]
else:
return some_default_value