1

I want to add total fields to this DataFrame:

df_test = pd.DataFrame([
    {'id':1,'cat1a':3,'cat1b':2, 'cat2a':4,'cat2b':3},
    {'id':2,'cat1a':7,'cat1b':5, 'cat2a':9,'cat2b':6}
])

This code almost works:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return t1, t2

df_test['cat1tot', 'cat2tot'] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Except it results in only 1 new column:

enter image description here

And this code:

 def add_total(therecord):
        t1 = therecord['cat1a'] + therecord['cat1b']
        t2 = therecord['cat2a'] + therecord['cat2b']
        return [t1, t2]

df_test[['cat1tot', 'cat2tot']] = df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

Results in: KeyError: "['cat1tot' 'cat2tot'] not in index"

I tried to resolve that with:

my_cols_list=['cat1tot','cat2tot']
df_test.reindex(columns=[*df_test.columns.tolist(), *my_cols_list], fill_value=0)

But that didn't solve the problem. So what am I missing?

9000
  • 39,899
  • 9
  • 66
  • 104
Brad Rhoads
  • 1,828
  • 3
  • 29
  • 52
  • Have you tried `.withColumn()`, possibly with `.drop()` to remove unneeded source columns? Or `df_test.select((df_test.cat1a + df_test.cat1b).alias("cat1tot"))`, etc? – 9000 Feb 27 '18 at 19:28
  • @9000 Those don't look like valid pandas functions to me... what version are you running? – cs95 Feb 27 '18 at 19:39
  • @BradRhoads, are you looking to just add totals or more complex calculations which cannot be vectorised? – jpp Feb 27 '18 at 19:39
  • @cᴏʟᴅsᴘᴇᴇᴅ: Ah, mistook Pandas `DataFrame` for Spark `DataFrame`! Hence the confusion. Indeed, this won't work with Pandas. – 9000 Feb 28 '18 at 01:10

3 Answers3

2

It's generally not a good idea to use df.apply unless you absolutely must. The reason is that these operations are not vectorised, i.e. in the background there is a loop where each row is fed into a function as its own pd.Series.

This would be a vectorised implementation:

df_test['cat1tot'] = df_test['cat1a'] + df_test['cat1b']
df_test['cat2tot'] = df_test['cat2a'] + df_test['cat2b']

#    cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
# 0      3      2      4      3   1        5        7
# 1      7      5      9      6   2       12       15
jpp
  • 159,742
  • 34
  • 281
  • 339
  • I'm assuming this was a toy example, otherwise this would have been my knee jerk response ;) – cs95 Feb 27 '18 at 19:37
  • @COLDSPEED, I don't know yet. I'll happily delete this answer if it turns out user wants to do more complex non-parallelisable operations :) – jpp Feb 27 '18 at 19:38
  • I'm fairly new to python. Can you explain a bit more? It *seems* like your solution would need to loop through the dataset twice instead of just once. Yes, this is a toy example. The real data has 10 sets of 7 columns, so I need to add tot1, tot2, . . ., tot10. – Brad Rhoads Feb 27 '18 at 19:41
  • Of course, happy to explain. It may seem like you are doing "2 lines of work", but the 2 lines should be significantly faster than the single `df.apply`. In the background, there are highly optimised `numpy`-based calculations [highly efficient library] for simple calculations such as `+`, `*`, etc, which mean the calculation is not cycling through one row at a time. See also: [Pandas - Explanation on apply function being slow](https://stackoverflow.com/questions/38697404/pandas-explanation-on-apply-function-being-slow) – jpp Feb 27 '18 at 19:44
  • @BradRhoads this approach will be **significantly** faster than any approach involving `.apply`. – juanpa.arrivillaga Feb 27 '18 at 19:45
  • I feel torn about accepting this answer, but it's actually the most helpful. – Brad Rhoads Feb 27 '18 at 19:51
  • @BradRhoads why do you feel torn? This is the recommended way to do such a thing with pandas/numpy. – juanpa.arrivillaga Feb 27 '18 at 20:05
  • I was unsure which answer to accept because @COLDSPEED 's answer was closest to my original approach. – Brad Rhoads Feb 27 '18 at 20:31
  • @BradRhoads If your actual problem was just adding columns together, then this really is what you should be using. Albeit there are things you could be doing to prevent writing the same thing N times, for each pair of columns. – cs95 Feb 27 '18 at 22:00
2

Return a Series object instead:

def add_total(therecord):
    t1 = therecord['cat1a'] + therecord['cat1b']
    t2 = therecord['cat2a'] + therecord['cat2b']

    return pd.Series([t1, t2])

And then,

df_test[['cat1tot', 'cat2tot']] = \
      df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)

df_test

   cat1a  cat1b  cat2a  cat2b  id  cat1tot  cat2tot
0      3      2      4      3   1        5        7
1      7      5      9      6   2       12       15

This works, because apply will special case the Series return type, and assume you want the result as a dataframe slice.

cs95
  • 379,657
  • 97
  • 704
  • 746
1

how about

df_test['cat1tot'], df_test['cat2tot'] =\
   df_test[['cat1a', 'cat1b', 'cat2a', 'cat2b']].apply(add_total,axis=1)
mortysporty
  • 2,749
  • 6
  • 28
  • 51