The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:
Creating a new column using data from other columns
Given the dataframe below:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
['cat', 'ragdoll', 1]],
columns=['animal', 'type', 'age'])
In[1]:
Out[1]:
animal type age
----------------------
0 dog hound 5
1 cat ragdoll 1
Below we are adding a new description
column as a concatenation of other columns by using the +
operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the +
applies to scalars and not 'primitive' values:
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
+ df.type + ' ' + df.animal
In [2]: df
Out[2]:
animal type age description
-------------------------------------------------
0 dog hound 5 A 5 years old hound dog
1 cat ragdoll 1 A 1 years old ragdoll cat
We get 1 years
for the cat (instead of 1 year
) which we will be fixing below using conditionals.
Modifying an existing column with conditionals
Here we are replacing the original animal
column with values from other columns, and using np.where
to set a conditional substring based on the value of age
:
# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')
In [3]: df
Out[3]:
animal type age
-------------------------------------
0 dog, hound, 5 years hound 5
1 cat, ragdoll, 1 year ragdoll 1
Modifying multiple columns with conditionals
A more flexible approach is to call .apply()
on an entire dataframe rather than on a single column:
def transform_row(r):
r.animal = 'wild ' + r.type
r.type = r.animal + ' creature'
r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
return r
df.apply(transform_row, axis=1)
In[4]:
Out[4]:
animal type age
----------------------------------------
0 wild hound dog creature 5 years
1 wild ragdoll cat creature 1 year
In the code above the transform_row(r)
function takes a Series
object representing a given row (indicated by axis=1
, the default value of axis=0
will provide a Series
object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.