Create new column with incremental values efficiently

Question

I am creating a column with incremental values and then appending a string at the start of the column. When used on large data this is very slow. Please suggest a faster and efficient way for the same.

df['New_Column'] = np.arange(df[0])+1
df['New_Column'] = 'str' + df['New_Column'].astype(str)

Input

id  Field   Value
1     A       1
2     B       0     
3     D       1

Output

id  Field   Value   New_Column
1     A       1     str_1
2     B       0     str_2
3     D       1     str_3

Related / possible dup: [Combine two columns of text in dataframe in pandas/python](https://stackoverflow.com/questions/19377969/combine-two-columns-of-text-in-dataframe-in-pandas-python) — jpp, Mar 24 '18 at 20:55

piRSquared · Answer 1 · 2018-03-26T03:49:56.213

I'll add two more in the mix

Numpy

from numpy.core.defchararray import add

df.assign(new=add('str_', np.arange(1, len(df) + 1).astype(str)))

   id Field  Value    new
0   1     A      1  str_1
1   2     B      0  str_2
2   3     D      1  str_3

`f-string` in comprehension

Python 3.6+

df.assign(new=[f'str_{i}' for i in range(1, len(df) + 1)])

   id Field  Value    new
0   1     A      1  str_1
1   2     B      0  str_2
2   3     D      1  str_3

Time Test

Conclusions

Comprehension wins the day with performance relative to simplicity. Mind you, this was cᴏʟᴅsᴘᴇᴇᴅ's proposed method. I appreciate the upvotes (thank you) but let's give credit where it's due.

Cythonizing the comprehension didn't seem to help. Nor did f-strings.
Divakar's numexp comes out on top for performance over larger data.

Functions

%load_ext Cython

%%cython
def gen_list(l, h):
    return ['str_%s' % i for i in range(l, h)]

pir1 = lambda d: d.assign(new=[f'str_{i}' for i in range(1, len(d) + 1)])
pir2 = lambda d: d.assign(new=add('str_', np.arange(1, len(d) + 1).astype(str)))
cld1 = lambda d: d.assign(new=['str_%s' % i for i in range(1, len(d) + 1)])
cld2 = lambda d: d.assign(new=gen_list(1, len(d) + 1))
jez1 = lambda d: d.assign(new='str_' + pd.Series(np.arange(1, len(d) + 1), d.index).astype(str))
div1 = lambda d: d.assign(new=create_inc_pattern(prefix_str='str_', start=1, stop=len(d) + 1))
div2 = lambda d: d.assign(new=create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=len(d) + 1))

Testing

res = pd.DataFrame(
    np.nan, [10, 30, 100, 300, 1000, 3000, 10000, 30000],
    'pir1 pir2 cld1 cld2 jez1 div1 div2'.split()
)

for i in res.index:
    d = pd.concat([df] * i)
    for j in res.columns:
        stmt = f'{j}(d)'
        setp = f'from __main__ import {j}, d'
        res.at[i, j] = timeit(stmt, setp, number=200)

Results

res.plot(loglog=True)

res.div(res.min(1), 0)

           pir1      pir2      cld1      cld2       jez1      div1      div2
10     1.243998  1.137877  1.006501  1.000000   1.798684  1.277133  1.427025
30     1.009771  1.144892  1.012283  1.000000   2.144972  1.210803  1.283230
100    1.090170  1.567300  1.039085  1.000000   3.134154  1.281968  1.356706
300    1.061804  2.260091  1.072633  1.000000   4.792343  1.051886  1.305122
1000   1.135483  3.401408  1.120250  1.033484   7.678876  1.077430  1.000000
3000   1.310274  5.179131  1.359795  1.362273  13.006764  1.317411  1.000000
10000  2.110001  7.861251  1.942805  1.696498  17.905551  1.974627  1.000000
30000  2.188024  8.236724  2.100529  1.872661  18.416222  1.875299  1.000000

More Functions

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(N+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(N+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

@Divakar updated. Looking to see if `astype(str)` costs too much — piRSquared, Mar 24 '18 at 23:49
@piRSquared Yeah `.astype(str)` looks like a good fix to it without too much of overhead. So, go ahead with it I would say at this point. — Divakar, Mar 24 '18 at 23:50
From what i just ran, you lose all advantage. Cython comes back on top. I suspect there must be another way. — piRSquared, Mar 24 '18 at 23:52
`df = df(Type=["str_" + str(i+1) for i in xrange(len(df))]).reset_index(drop=True)` This seems to be faster than rest. Can you test it? — Nani, Mar 25 '18 at 09:46
@piRSquared - I was really surprised why my solution was worst, but you miss I use `.map` instead `.astype`. Can you change timings? — jezrael, Mar 29 '18 at 12:23

cs95 · Answer 2 · 2018-03-24T22:01:42.077

16

When all else fails, use a list comprehension:

df['NewColumn'] = ['str_%s' %i for i in range(1, len(df) + 1)]

Further speedups are possible if you cythonize your function:

%load_ext Cython

%%cython
def gen_list(l, h):
    return ['str_%s' %i for i in range(l, h)]

Note, this code is run on Python3.6.0 (IPython6.2.1). Solution improved thanks to @hpaulj in the comments.

# @jezrael's fastest solution

%%timeit
df['NewColumn'] = np.arange(len(df['a'])) + 1
df['NewColumn'] = 'str_' + df['New_Column'].map(str)

547 ms ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# in this post - no cython

%timeit df['NewColumn'] = ['str_%s'%i for i in range(n)]
409 ms ± 9.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# cythonized list comp 

%timeit df['NewColumn'] = gen_list(1, len(df) + 1)
370 ms ± 9.23 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

edited Mar 24 '18 at 22:01

answered Mar 24 '18 at 21:11

cs95

379,657
97
704
746

hmmm, interesting, for me list comprehension is slowier, maybe because windows? Not sure – jezrael Mar 24 '18 at 21:24
@jezrael Very interesting, I guess it depends on a lot of factors, most likely the OS as I am running this on a Unix machine. – cs95 Mar 24 '18 at 21:25
Yes, I already try it but dont add to solution, because slowier ;) – jezrael Mar 24 '18 at 21:26
refresh, I change comment :) – jezrael Mar 24 '18 at 21:27
@jezrael I will leave this answer here because I think it may be useful for OP ;) – cs95 Mar 24 '18 at 21:28
no problem, only I was interested what should be difference. – jezrael Mar 24 '18 at 21:29
Something does seem a bit fishy in @jezrael timing though as the standard deviation is greater than the mean – DJK Mar 24 '18 at 21:30
@DJK - hmmm, interested. Not sure why. – jezrael Mar 24 '18 at 21:31
1

@DJK - I run timings again and better times and better std – jezrael Mar 24 '18 at 21:33
1

@DJK Thanks for pointing that out. I certainly did not notice that :/ – cs95 Mar 24 '18 at 21:34
In my tests, `['str_%s'%i for i in range(n)]` is fastest. – hpaulj Mar 24 '18 at 21:57
@hpaulj Thanks for that tidbit! Is it alright if I update my answer? Edit: I just did, let me know if you want to post and I'll roll it back. – cs95 Mar 24 '18 at 21:58

Divakar · Answer 3 · 2018-03-25T10:21:31.143

Proposed approach

After tinkering quite a bit with the string and numeric dtypes and leveraging the easy interoperability between them, here's something that I ended up with to get zeros padded strings, as NumPy does well and allows for vectorized operations that way -

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

Brining in numexpr for faster broadcasting + modulus operation -

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]
    a1 = np.repeat(a0[None],N,axis=0)

    r = np.arange(start, stop)
    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1[:,len(prefix_str):] += addn.astype(a1.dtype)
    return a1.view('S'+str(a1.shape[1])).ravel()

So, to use as the new column :

df['New_Column'] = create_inc_pattern(prefix_str='str_', start=1, stop=len(df)+1)

Sample runs -

In [334]: create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=14)
Out[334]: 
array(['str_01', 'str_02', 'str_03', 'str_04', 'str_05', 'str_06',
       'str_07', 'str_08', 'str_09', 'str_10', 'str_11', 'str_12', 'str_13'], 
      dtype='|S6')

In [338]: create_inc_pattern(prefix_str='str_', start=1, stop=124)
Out[338]: 
array(['str_001', 'str_002', 'str_003', 'str_004', 'str_005', 'str_006',
       'str_007', 'str_008', 'str_009', 'str_010', 'str_011', 'str_012',..
       'str_115', 'str_116', 'str_117', 'str_118', 'str_119', 'str_120',
       'str_121', 'str_122', 'str_123'], 
      dtype='|S7')

Explanation

Basic idea and explanation with step-by-step sample run

The basic idea is creating the ASCII equivalent numeric array, which could be viewed or converted by dtype conversion into a string one. To be more specific, we would create uint8 type numerals. Thus, each string would be represented by a 1D array of numerals. For the list of strings that would translate to a 2D array of numerals with each row (1D array) representing a single string.

1) Inputs :

In [22]: prefix_str='str_'
    ...: start=15
    ...: stop=24

2) Parameters :

In [23]: N = stop - start # count of numbers
    ...: W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string

In [24]: N,W
Out[24]: (9, 2)

3) Create 1D array of numerals representing the starting string :

In [25]: padv = np.full(W,48,dtype=np.uint8)
    ...: a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

In [27]: a0
Out[27]: array([115, 116, 114,  95,  48,  48], dtype=uint8)

4) Extend to cover range of strings as 2D array :

In [33]: a1 = np.repeat(a0[None],N,axis=0)
    ...: r = np.arange(start, stop)
    ...: addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    ...: a1[:,len(prefix_str):] += addn.astype(a1.dtype)

In [34]: a1
Out[34]: 
array([[115, 116, 114,  95,  49,  53],
       [115, 116, 114,  95,  49,  54],
       [115, 116, 114,  95,  49,  55],
       [115, 116, 114,  95,  49,  56],
       [115, 116, 114,  95,  49,  57],
       [115, 116, 114,  95,  50,  48],
       [115, 116, 114,  95,  50,  49],
       [115, 116, 114,  95,  50,  50],
       [115, 116, 114,  95,  50,  51]], dtype=uint8)

5) Thus, each row represents ascii equivalent of a string each off the desired output. Let's get it with the final step :

In [35]: a1.view('S'+str(a1.shape[1])).ravel()
Out[35]: 
array(['str_15', 'str_16', 'str_17', 'str_18', 'str_19', 'str_20',
       'str_21', 'str_22', 'str_23'], 
      dtype='|S6')

Timings

Here's a quick timings test against the list comprehension version that seems to be working the best looking at the timings from other posts -

In [339]: N = 10000

In [340]: %timeit ['str_%s'%i for i in range(N)]
1000 loops, best of 3: 1.12 ms per loop

In [341]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
1000 loops, best of 3: 490 µs per loop

In [342]: N = 100000

In [343]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 14 ms per loop

In [344]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 4 ms per loop

Python-3 codes

On Python-3, to get the string dtype array, we were needed to pad with few more zeros on the intermediate int dtype array. So, the equivalent without and with numexpr versions for Python-3 ended up becoming something along these lines -

Method #1 (No numexpr) :

def create_inc_pattern(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    addn = (r[:,None] // 10**np.arange(W-1,-1,-1))%10
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

Method #2 (With numexpr) :

import numexpr as ne

def create_inc_pattern_numexpr(prefix_str, start, stop):
    N = stop - start # count of numbers
    W = int(np.ceil(np.log10(stop+1))) # width of numeral part in string
    dl = len(prefix_str)+W # datatype length
    dt = np.uint8 # int datatype for string to-from conversion 

    padv = np.full(W,48,dtype=np.uint8)
    a0 = np.r_[np.fromstring(prefix_str,dtype='uint8'), padv]

    r = np.arange(start, stop)

    r2D = r[:,None]
    s = 10**np.arange(W-1,-1,-1)
    addn = ne.evaluate('(r2D/s)%10')
    a1 = np.repeat(a0[None],N,axis=0)
    a1[:,len(prefix_str):] += addn.astype(dt)
    a1.shape = (-1)

    a2 = np.zeros((len(a1),4),dtype=dt)
    a2[:,0] = a1
    return np.frombuffer(a2.ravel(), dtype='U'+str(dl))

Timings -

In [8]: N = 100000

In [9]: %timeit ['str_%s'%i for i in range(N)]
100 loops, best of 3: 18.5 ms per loop

In [10]: %timeit create_inc_pattern_numexpr(prefix_str='str_', start=1, stop=N)
100 loops, best of 3: 6.06 ms per loop

This deserves a bounty. If it's not too much trouble, can you please add comments so that lesser mortals can understand what is happening? ;) — cs95, Mar 24 '18 at 23:28
I'm getting byte strings. I have to encode. So now I have to actually understand your code first (-: Still working on it — piRSquared, Mar 24 '18 at 23:37
@piRSquared Ah, must be that Python3 thing. Damn! I am on Python-2. — Divakar, Mar 24 '18 at 23:37
a simple `astype(str)` fixes it but I'm sure that's clumsy. Running without that for now. — piRSquared, Mar 24 '18 at 23:40
I would recommend switching to python3, because there are a lot of performance improvements (including to the list comp as well, I would bet) — cs95, Mar 24 '18 at 23:56

jezrael · Answer 4 · 2018-03-26T05:18:56.870

4

One possible solution is with convert values to strings by map:

df['New_Column'] = np.arange(len(df['a']))+1
df['New_Column'] = 'str_' + df['New_Column'].map(str)

edited Mar 26 '18 at 05:18

answered Mar 24 '18 at 20:33

jezrael

822,522
95
1,334
1,252

Create new column with incremental values efficiently

Input

Output

4 Answers4

Numpy

`f-string` in comprehension

Time Test

Conclusions

Functions

Testing

Results

More Functions

Proposed approach

Explanation

Timings

Python-3 codes

Linked

Create new column with incremental values efficiently

Input

Output

4 Answers4

Numpy

f-string in comprehension

Time Test

Conclusions

Functions

Testing

Results

More Functions

Proposed approach

Explanation

Timings

Python-3 codes

Linked

`f-string` in comprehension