Most pythonic way to concatenate pandas cells with conditions

Question

I have the following Pandas DataFrame, with city and arr columns:

city      arr  final_target
paris     11   paris_11
paris     12   paris_12
dallas    22   dallas
miami     15   miami
paris     16   paris_16

My goal is to fill the final_target column concatenating paris and arr number, when city name is Paris, and just filling with the name when the name is not Paris.

What is the most pythonic way to do this ?

jezrael · Accepted Answer · 2021-09-24T11:16:57.397

What is the most pythonic way to do this ?

It depends by definion. If it is more preferable, most common and fastest way then np.where solution is here most pythonic way.

Use numpy.where, if need pandaic also this solutions are vectorized, so should be more preferable like apply (loops under the hood):

df['final_target'] = np.where(df['city'].eq('paris'), 
                              df['city'] + '_' + df['arr'].astype(str), 
                              df['city'])

Pandas alternatives:

df['final_target'] = df['city'].mask(df['city'].eq('paris'), 
                                     df['city'] + '_' + df['arr'].astype(str))

df['final_target'] = df['city'].where(df['city'].ne('paris'), 
                                      df['city'] + '_' + df['arr'].astype(str))
print (df)
     city  arr final_target
0   paris   11     paris_11
1   paris   12     paris_12
2  dallas   22       dallas
3   miami   15        miami
4   paris   16     paris_16

Performance:

#50k rows
df = pd.concat([df] * 10000, ignore_index=True)
    

In [157]: %%timeit
     ...: df['final_target'] = np.where(df['city'].eq('paris'), 
     ...:                               df['city'] + '_' + df['arr'].astype(str), 
     ...:                               df['city'])
     ...:                               
48.6 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [158]: %%timeit
     ...: df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
     ...: 
     ...: 
49.2 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [159]: %%timeit
     ...: df['final_target'] = df['city']
     ...: df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df['arr'].astype(str)
     ...: 
63.8 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [160]: %%timeit
     ...: df['final_target'] = df.apply(lambda x: x.city + '_' + str(x.arr) if x.city == 'paris' else x.city, axis = 1)
     ...: 
     ...: 
1.33 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Do you have explanation as to why third speed in my solution fails due to memory but others dont? — eroot163pi, Sep 24 '21 at 11:20

score 3 · Answer 2 · answered Sep 24 '21 at 10:31

3

A one-liner code does the trick:

df['final_target'] = df.apply(lambda x: x.city + '_' + str(x.arr) if x.city == 'paris' else x.city, axis = 1)

answered Sep 24 '21 at 10:31

ql.user2511

369
2
12

1

This solution is not pythonic in my opinion, because exist vectorized faster alternatives - check https://stackoverflow.com/a/54432584/2901002 – jezrael Sep 24 '21 at 10:44
Why is it not pythonic? – ql.user2511 Sep 24 '21 at 10:51
I think becaue loops unter the hood. I add link for more explanation why should be avoided this `method` – jezrael Sep 24 '21 at 11:00
It says that ```apply``` consumes a lot of memory, since the function is "applied" row by row. It might be slower than the other functions, but I don't think that this makes it "not pythonic"... – ql.user2511 Sep 24 '21 at 11:02
ya, it depends what means more pythonic. If it is more preferable, most common and fastest way then this is not pythonic. – jezrael Sep 24 '21 at 11:03
1

Yeah, I agree. It all lies in what "pythonic" truly means. Maybe if the user was looking for a "fast" way, then an alternative way would have been more appropriate. – ql.user2511 Sep 24 '21 at 11:05

U13-Forward · Answer 3 · 2021-09-24T12:43:18.003

3

Try this neat and and short two lines with loc:

df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df.loc[df['city'] == 'paris', 'arr'].astype(str)

This solution firstly assigns df['city'] as the final_target column, then adds the arr column separated by underscore if the city column is paris.

IMO this is probably the most Pythonic and neat way here.

print(df)

     city  arr final_target
0   paris   11     paris_11
1   paris   12     paris_12
2  dallas   22       dallas
3   miami   15        miami
4   paris   16     paris_16

edited Sep 24 '21 at 12:43

answered Sep 24 '21 at 10:33

U13-Forward

69,221
14
89
114

I think this one takes more memory, fails on large input – eroot163pi Sep 24 '21 at 11:12
@eroot163pi How about yours, apply is super slow... – U13-Forward Sep 24 '21 at 11:13
@eroot163pi Oh sorry, i thought you were ql.user2511... – U13-Forward Sep 24 '21 at 11:15
1

I am nor sure why that happens :P – U13-Forward Sep 24 '21 at 11:16
df.loc and df.arr they dont have same shape but still your output is correct maybe it macthes the index. Still not clear about memory error and also the sampling. https://stackoverflow.com/questions/69314257/why-memory-error-when-using-loc-in-pandas-instead-of-direct-computing – eroot163pi Sep 24 '21 at 12:31

eroot163pi · Answer 4 · 2021-09-24T12:34:44.833

0

Pretty self explanatory, one line, looks pythonic

df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))

s = """city,arr,final_target
paris,11,paris_11
paris,12,paris_12
dallas,22,dallas
miami,15,miami
paris,16,paris_16"""
import pandas as pd
import io
df = pd.read_csv(io.StringIO(s)).sample(1000000, replace=True)
df

Speeds

%%timeit
df['city'] + (df['city'] == 'paris')*('_' + df['arr'].astype(str))
# 877 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df['final_target'] = np.where(df['city'].eq('paris'), 
                              df['city'] + '_' + df['arr'].astype(str), 
                              df['city'])
# 874 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am not sure why this example fails(update: failing due to sampling) but memory error is still a mystery Why memory error when using .loc in pandas with sampling instead of direct computing

%%timeit
df['final_target'] = df['city']
df.loc[df['city'] == 'paris', 'final_target'] +=  '_' + df['arr'].astype(str)

MemoryError: Unable to allocate 892. GiB for an array with shape (119671145392,) and data type int64

edited Sep 24 '21 at 12:34

answered Sep 24 '21 at 10:49

eroot163pi

1,791
1
11
23

what is number of rows for test speed? – jezrael Sep 24 '21 at 11:04
I hope not 5 rows. – jezrael Sep 24 '21 at 11:05
lol...sorry, lemme check on big ones – eroot163pi Sep 24 '21 at 11:05
On big ones not much difference? – eroot163pi Sep 24 '21 at 11:07
ya, it is same. – jezrael Sep 24 '21 at 11:08
THird one is giving memory error – eroot163pi Sep 24 '21 at 11:10

Most pythonic way to concatenate pandas cells with conditions

4 Answers4

Linked