How to add multiple columns to a DataFrame?

Question

I want to add total fields to this DataFrame:

df_test = pd.DataFrame([
    {'id':1,'cat1a':3,'cat1b':2, 'cat2a':4,'cat2b':3},
    {'id':2,'cat1a':7,'cat1b':5, 'cat2a':9,'cat2b':6}
])

This code almost works:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return t1, t2

df_test['cat1tot', 'cat2tot'] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Except it results in only 1 new column:

And this code:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return [t1, t2]

df_test[['cat1tot', 'cat2tot']] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Results in: KeyError: "['cat1tot' 'cat2tot'] not in index"

I tried to resolve that with:

my_cols_list=['cat1tot','cat2tot']
df_test.reindex(columns=[*df_test.columns.tolist(), *my_cols_list], fill_value=0)

But that didn't solve the problem. So what am I missing?

Have you tried `.withColumn()`, possibly with `.drop()` to remove unneeded source columns? Or `df_test.select((df_test.cat1a + df_test.cat1b).alias("cat1tot"))`, etc? — 9000, Feb 27 '18 at 19:28
@9000 Those don't look like valid pandas functions to me... what version are you running? — cs95, Feb 27 '18 at 19:39
@BradRhoads, are you looking to just add totals or more complex calculations which cannot be vectorised? — jpp, Feb 27 '18 at 19:39
@cᴏʟᴅsᴘᴇᴇᴅ: Ah, mistook Pandas `DataFrame` for Spark `DataFrame`! Hence the confusion. Indeed, this won't work with Pandas. — 9000, Feb 28 '18 at 01:10

score 2 · Accepted Answer · answered Feb 27 '18 at 19:35

2

It's generally not a good idea to use df.apply unless you absolutely must. The reason is that these operations are not vectorised, i.e. in the background there is a loop where each row is fed into a function as its own pd.Series.

This would be a vectorised implementation:

df_test['cat1tot'] = df_test['cat1a'] + df_test['cat1b']
df_test['cat2tot'] = df_test['cat2a'] + df_test['cat2b']

#    cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
# 0      3      2      4      3   1        5        7
# 1      7      5      9      6   2       12       15

answered Feb 27 '18 at 19:35

jpp

159,742
34
281
339

I'm assuming this was a toy example, otherwise this would have been my knee jerk response ;) – cs95 Feb 27 '18 at 19:37
@COLDSPEED, I don't know yet. I'll happily delete this answer if it turns out user wants to do more complex non-parallelisable operations :) – jpp Feb 27 '18 at 19:38
I'm fairly new to python. Can you explain a bit more? It *seems* like your solution would need to loop through the dataset twice instead of just once. Yes, this is a toy example. The real data has 10 sets of 7 columns, so I need to add tot1, tot2, . . ., tot10. – Brad Rhoads Feb 27 '18 at 19:41
Of course, happy to explain. It may seem like you are doing "2 lines of work", but the 2 lines should be significantly faster than the single `df.apply`. In the background, there are highly optimised `numpy`-based calculations [highly efficient library] for simple calculations such as `+`, `*`, etc, which mean the calculation is not cycling through one row at a time. See also: [Pandas - Explanation on apply function being slow](https://stackoverflow.com/questions/38697404/pandas-explanation-on-apply-function-being-slow) – jpp Feb 27 '18 at 19:44
@BradRhoads this approach will be **significantly** faster than any approach involving `.apply`. – juanpa.arrivillaga Feb 27 '18 at 19:45
I feel torn about accepting this answer, but it's actually the most helpful. – Brad Rhoads Feb 27 '18 at 19:51
@BradRhoads why do you feel torn? This is the recommended way to do such a thing with pandas/numpy. – juanpa.arrivillaga Feb 27 '18 at 20:05
I was unsure which answer to accept because @COLDSPEED 's answer was closest to my original approach. – Brad Rhoads Feb 27 '18 at 20:31
@BradRhoads If your actual problem was just adding columns together, then this really is what you should be using. Albeit there are things you could be doing to prevent writing the same thing N times, for each pair of columns. – cs95 Feb 27 '18 at 22:00

score 2 · Answer 2 · answered Feb 27 '18 at 19:35

Return a Series object instead:

def add_total(therecord):
    t1 = therecord['cat1a'] + therecord['cat1b']
    t2 = therecord['cat2a'] + therecord['cat2b']

    return pd.Series([t1, t2])

And then,

df_test[['cat1tot', 'cat2tot']] = \
      df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

df_test

   cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
0      3      2      4      3   1        5        7
1      7      5      9      6   2       12       15

This works, because apply will special case the Series return type, and assume you want the result as a dataframe slice.

This works and is the closest to what I was trying to do. – Brad Rhoads Feb 27 '18 at 19:50 — Brad Rhoads, Feb 27 '18 at 19:50

score 1 · Answer 3 · answered Feb 27 '18 at 19:29

1

how about

df_test['cat1tot'], df_test['cat2tot'] =\
   df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

answered Feb 27 '18 at 19:29

mortysporty

2,749
6
28
51

How to add multiple columns to a DataFrame?

3 Answers3