Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

Question

I want to combine two int columns to create a new dot-separated str column. I've got one way that works but if there is a faster way, it would help. I've also tried a suggestion I found in another answer on SO that produces an error.

This works:

df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
                   'task_number': [38544, 0, 1]})
df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)

0    3913291.38544
1    3887250.0    
2    3913041.1

This answer to a similar question suggests a "numpy" way, using .values.astype(str), but I haven't gotten it to work yet. Here I run it without including the dot separator:

df3['job_number'].values.astype(int).astype(str) + df3['task_number'].astype(int).astype(str)

0    391329138544
1    38872500    
2    39130411

But when I include the dot separator I get an error:

df3['job_number'].values.astype(int).astype(str) + '.' + df3['task_number'].astype(int).astype(str)

TypeError: ufunc 'add' did not contain a loop with signature matching types dtype('<U11') dtype('<U11') dtype('<U11')

The result I want is:

0    3913291.38544
1    3887250.0    
2    3913041.1

+ is defined for python strings. It is not defined for numpy string dtypes. — hpaulj, Mar 03 '19 at 13:04

meW · Accepted Answer · 2019-03-03T08:59:36.323

3

For comparison of given methods with other available methods do refer @Jezrael answer.

Method 1

To add a dummy column containing ., use it in processing and later drop it:

%%timeit
df3['dummy'] ='.'
res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
df3.drop(columns=['dummy'], inplace=True)

1.31 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

To the extension of method 1, if you exclude the processing time of dummy column creation and dropping it then it is the best you get -

%%timeit
df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)

286 µs ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Method 2

Use apply

%timeit df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))

883 µs ± 22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

edited Mar 03 '19 at 08:59

answered Mar 03 '19 at 08:17

meW

3,832
7
27

May I ask why you include `.values` for the first column but not the third? – Karl Baker Mar 03 '19 at 08:21
@KarlBaker Just a typo. – meW Mar 03 '19 at 08:21
@jezrael The part `1.31 ms ± 41.1 µs per loop` is what you are looking for and I've already added it. Yes it's done on creation, data processing and deletion using `%%timeit` – meW Mar 03 '19 at 08:39
1

I'm going to post a list of methods and times in my original question. I ran them on the 3-row df and on a 57K row df. – Karl Baker Mar 03 '19 at 09:39
1

`df3['dummy'] ='.'` solved the error problem. Thanks! – Karl Baker Mar 03 '19 at 10:24

jezrael · Answer 2 · 2019-03-03T08:54:04.383

You can use list comprehension:

df3["filename"] = ['.'.join(i) for i in 
                          zip(df3["job_number"].map(str),df3["task_number"].map(str))]

If use python 3.6+ the fastest solution with f-strings:

df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]

Performance in 30k rows:

df3 = pd.DataFrame({'job_number': [3913291, 3887250, 3913041],
                   'task_number': [38544, 0, 1]})
df3 = pd.concat([df3] * 10000, ignore_index=True)


In [64]: %%timeit
    ...: df3["filename2"] = [f'{i}.{j}' for i,j in zip(df3["job_number"],df3["task_number"])]
    ...: 
20.5 ms ± 226 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [65]: %%timeit
    ...: df3["filename3"] = ['.'.join(i) for i in zip(df3["job_number"].map(str),df3["task_number"].map(str))]
    ...: 
30.9 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [66]: %%timeit
    ...: df3["filename4"] = df3.T.apply(lambda x: str(x[0]) + '.' + str(x[1]))
    ...: 
1.7 s ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [67]: %%timeit
    ...: df3['dummy'] ='.'
    ...: res = df3['job_number'].values.astype(str) + df3['dummy'] + df3['task_number'].values.astype(str)
    ...: df3.drop(columns=['dummy'], inplace=True)
    ...: 
73.6 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

But also very fast is original solution:

In [73]: %%timeit
    ...: df3['filename'] = df3['job_number'].astype(str) + '.' + df3['task_number'].astype(str)
48.3 ms ± 872 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

With small modification - using map instead astype:

In [76]: %%timeit
    ...: df3['filename'] = df3['job_number'].map(str) + '.' + df3['task_number'].map(str)
    ...: 
26 ms ± 676 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So finally it all boils down to `map` haha (+1). Thanks for a great comparison — meW, Mar 03 '19 at 08:51
@jezrael Cool, thanks! I've been busy trying to run time tests on all the permutations. Not sure if I captured them all (I'm sure I didn't), but I'll post the list to my original question. — Karl Baker, Mar 03 '19 at 09:37
@jezrael, when I try `map` I get this error: `TypeError: Series cannot perform the operation +`. I'm running numpy 1.15.4, pandas 0.23.4, python 3.7.1, jupyter-client 5.2.3. — Karl Baker, Mar 03 '19 at 10:15

Karl Baker · Answer 3 · 2019-03-03T10:33:01.137

Methods in order of %%timeit results

I timed all the suggested methods and a few more on two DataFrames. Here are the timed results for the suggested methods (thank you @meW and @jezrael). If I missed any or you have another, let me know and I'll add it.

Two timings are shown for each method: first for processing the 3 rows in the example df and then for processing 57K rows in another df. Timings may vary on another system. Solutions that include TEST['dot'] in the concatenation string require this column in the df: add it with TEST['dot'] = '.'.

Original method (still the fastest):

.astype(str), +, '.'

%%timeit
TEST['filename'] = TEST['job_number'].astype(str) + '.' + TEST['task_number'].astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 69.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows

Proposed methods and a few permutations on them:

.astype(int).astype(str), +, '.'

%%timeit
TEST['filename'] = TEST['job_number'].astype(int).astype(str) + '.' + TEST['task_number'].astype(int).astype(str)
# 553 µs ± 6.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 70.2 ms ± 739 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows

.values.astype(int).astype(str), +, TEST['dot']

%%timeit
TEST['filename'] = TEST['job_number'].values.astype(int).astype(str) + TEST['dot'] + TEST['task_number'].values.astype(int).astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 82.3 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows

.values.astype(str), +, TEST['dot']

%%timeit
TEST["filename"] = TEST['job_number'].values.astype(str) + TEST['dot'] + TEST['task_number'].values.astype(str)
# 221 µs ± 5.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 92.8 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows

'.'.join(), list comprehension, .values.astype(str)

%%timeit
TEST["filename"] = ['.'.join(i) for i in TEST[["job_number",'task_number']].values.astype(str)]
# 743 µs ± 19.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 147 ms ± 532 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) on 57K rows

f-string, list comprehension, .values.astype(str)

%%timeit
TEST["filename2"] = [f'{i}.{j}' for i,j in TEST[["job_number",'task_number']].values.astype(str)]
# 642 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 167 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows

'.'.join(), zip, list comprehension, .map(str)

%%timeit
TEST["filename"] = ['.'.join(i) for i in 
                          zip(TEST["job_number"].map(str), TEST["task_number"].map(str))]
# 512 µs ± 5.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 181 ms ± 4.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows

apply(lambda, str(x[2]), +, '.')

%%timeit
TEST['filename'] = TEST.T.apply(lambda x: str(x[2]) + '.' + str(x[10]))
# 735 µs ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) on 3 rows
# 2.69 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) on 57K rows

If you see a way to improve on any of these, please let me know and I'll add to the list!

Is there a faster (numpy?) way to combine pandas df int columns into dot-separated str col without TypeError

3 Answers3

Method 1

Method 2