2

I want to add two columns to a pandas Dataframe using a function that gives back a tuple as such:

data=pd.DataFrame({'a':[1,2,3,4,5,6],'b':['ssdfsdf','bbbbbb','cccccccccccc','ddd','eeeeee','ffffff']})

def givetup(string):
    
    result1 = string[0:3]
    # please imagine here a bunch of string functions concatenated.
    # including nlp methods with SpaCy 
    result2 = result1.upper()
    # the same here, imagine a bunch of steps to calculate result2 based on result 1
    
    return (result1,result2)

data['c'] = data['b'].apply(lambda x: givetup(x)[0])
data['d'] = data['b'].apply(lambda x: givetup(x)[1])

This is very inefficient (I am dealing with millions of rows) since I call two times the same function and make two calculations. Since result2 depends on result 1 I better not separate givetup into two functions How can I assign in one go result1 and result2 into new columns c and d with only one call to the function? what is the most efficient way to do it?

Please bear in mind that result1 and result2 are heavily time consuming string calculations.

EDIT 1: I knew about this: Apply pandas function to column to create multiple new columns?

i.e. applying vectorized functions. In my particular case it is highly undesirable or perhaps even impossible. Imagine that result 1 and result 2 are calculated based on language models and I need the plain text.

JFerro
  • 3,203
  • 7
  • 35
  • 88
  • 2
    *result2 depends on result 1* is it possible to write two (vectorized) functions, one to get `result1` and one to get `result2` separately. Then you can do `data['c'] = func1(data['b']); data['d'] = func2(data['c'])`? – Quang Hoang Mar 26 '21 at 19:10
  • To follow up on what @QuangHoang said. I vectorized like this `data.assign(c=lambda d: d.b.str[0:3], d=lambda d: d.c.str.upper())` – piRSquared Mar 26 '21 at 19:15

3 Answers3

2

You can try list comprehension here:

data[['c','d']] = [givetup(a) for a in data['b']]

Output:

   a             b    c    d
0  1       ssdfsdf  ssd  SSD
1  2        bbbbbb  bbb  BBB
2  3  cccccccccccc  ccc  CCC
3  4           ddd  ddd  DDD
4  5        eeeeee  eee  EEE
5  6        ffffff  fff  FFF
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
2

zip/map

data['c'], data['d'] = zip(*map(givetup, data['b']))

data

   a             b    c    d
0  1       ssdfsdf  ssd  SSD
1  2        bbbbbb  bbb  BBB
2  3  cccccccccccc  ccc  CCC
3  4           ddd  ddd  DDD
4  5        eeeeee  eee  EEE
5  6        ffffff  fff  FFF

Series.str and assign

This is specific to the examples given in givetup. But if it is possible to disentangle, then it is likely worth it.

The assign method arguments can take calables that reference columns created in an argument jus prior (NEAT).

data.assign(c=lambda d: d.b.str[0:3], d=lambda d: d.c.str.upper())

   a             b    c    d
0  1       ssdfsdf  ssd  SSD
1  2        bbbbbb  bbb  BBB
2  3  cccccccccccc  ccc  CCC
3  4           ddd  ddd  DDD
4  5        eeeeee  eee  EEE
5  6        ffffff  fff  FFF

Timings

data = pd.concat([data] * 10_000, ignore_index=True)

%timeit data['c'], data['d'] = zip(*map(givetup, data['b']))
%timeit data[['c','d']] = [givetup(a) for a in data['b']]
%timeit data.assign(c=lambda d: d.b.str[0:3], d=lambda d: d.c.str.upper())

69.7 ms ± 865 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
137 ms ± 937 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
34.6 ms ± 235 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Seems like the second one is the way to go. The third one is not possible in myparticular case – JFerro Mar 26 '21 at 19:37
  • the first and the second solutions here give: /opt/conda/lib/python3.8/site-packages/numpy/core/_asarray.py:102: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. return array(a, dtype, copy=False, order=order) what does it mean? – JFerro Mar 26 '21 at 19:45
  • that means your tuples are not of length two and the toy example was not representative of your actual situation. – piRSquared Mar 26 '21 at 19:58
  • 1
    You can do this to ensure you always only get two, `data['c'], data['d'] = zip(*[x[:2] for x in map(givetup, data['b'])])` – piRSquared Mar 26 '21 at 19:59
0

Another way to do this is use apply function with series:

import pandas as pd

data=pd.DataFrame({'a':[1,2,3,4,5,6],'b':['ssdfsdf','bbbbbb','cccccccccccc','ddd','eeeeee','ffffff']})

def givetup(column):
    
    column1 = column[0:3]
    column2 = column[0:3].upper()
    
    return pd.Series([column1, column2])

data[['c','d']] = data['b'].apply(lambda x: givetup(x))
Bricam
  • 71
  • 5
  • 1
    This approach is likely to be very inefficient. First of all, you are creating a `pd.Series` for every row. Then you are asking Pandas to align those new columns for every row. Also, `column1 = column[0:3]` already slices the string. You do it again when you `column2 = column[0:3].upper()` that is wasteful to do every row. The timings for this are on the order of 1000 times slower. – piRSquared Mar 26 '21 at 19:28