187

I'm trying to reproduce my Stata code in Python, and I was pointed in the direction of Pandas. I am, however, having a hard time wrapping my head around how to process the data.

Let's say I want to iterate over all values in the column head 'ID.' If that ID matches a specific number, then I want to change two corresponding values FirstName and LastName.

In Stata it looks like this:

replace FirstName = "Matt" if ID==103
replace LastName =  "Jones" if ID==103

So this replaces all values in FirstName that correspond with values of ID == 103 to Matt.

In Pandas, I'm trying something like this

df = read_csv("test.csv")
for i in df['ID']:
    if i ==103:
          ...

Not sure where to go from here. Any ideas?

Parseltongue
  • 11,157
  • 30
  • 95
  • 160

7 Answers7

313

One option is to use Python's slicing and indexing features to logically evaluate the places where your condition holds and overwrite the data there.

Assuming you can load your data directly into pandas with pandas.read_csv then the following code might be helpful for you.

import pandas
df = pandas.read_csv("test.csv")
df.loc[df.ID == 103, 'FirstName'] = "Matt"
df.loc[df.ID == 103, 'LastName'] = "Jones"

As mentioned in the comments, you can also do the assignment to both columns in one shot:

df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'

Note that you'll need pandas version 0.11 or newer to make use of loc for overwrite assignment operations. Indeed, for older versions like 0.8 (despite what critics of chained assignment may say), chained assignment is the correct way to do it, hence why it's useful to know about even if it should be avoided in more modern versions of pandas.


Another way to do it is to use what is called chained assignment. The behavior of this is less stable and so it is not considered the best solution (it is explicitly discouraged in the docs), but it is useful to know about:

import pandas
df = pandas.read_csv("test.csv")
df['FirstName'][df.ID == 103] = "Matt"
df['LastName'][df.ID == 103] = "Jones"
ely
  • 74,674
  • 34
  • 147
  • 228
  • 22
    how about adding also this flavor: `df.loc[df.ID == 103, ['FirstName', 'LastName']] = 'Matt', 'Jones'` – Zeugma Oct 07 '13 at 13:54
  • 5
    -1 "Another way to do it is to use what is called chained assignment." No. Emphatically, no. It's *only* useful to know that chained assignment isn't reliable. It's not that it's a reliable, non-optimal solution, [the situation is much worse](http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-view-versus-copy). You've even acknowledged this [elsewhere on Stack Overflow](http://stackoverflow.com/a/19125721/564538). Please try to avoid giving the illusion that chained assignment is a viable option. The first two methods you gave were enough, and are the preferred way to do this. – Phillip Cloud Oct 07 '13 at 15:55
  • 12
    The internet is serious business. At any rate, EMS, I appreciated knowing the option exists. – Parseltongue Oct 09 '13 at 06:44
  • One issue you might run into is that the csv has periods/dots in the column names and assignments get messed up. You can fix the columns using something like this: cols = df.columns cols = cols.map(lambda x: x.replace('.', '_') if isinstance(x, str) else x) df.columns = cols – ski_squaw Nov 09 '16 at 19:29
  • How do I use the above example for multiple conditions like df.ID = 103 and df.name = "Mathew" ? – Adarsh Trivedi Jul 18 '18 at 08:44
  • @AdarshTrivedi are you trying to filter the dataframe by two different columns? This question is only about *assigning* into multiple columns. You can use the numpy `logical_and` to combine two boolean-valued Pandas Series like `np.logical_and(df['ID'] == 103, df['name'] == 'Matthew')`. You can use the result of that as a row filter like any other conditional. – ely Jul 26 '18 at 12:05
  • what if instead of `== 103` the value is `>103` and I want to set the value to `Matt + Id`, how would I get the id column value which is `>103` – devssh Aug 09 '18 at 09:36
  • Hey is there a way to add iterable? I keep on getting `ValueError: Must have equal len keys and value when setting with an iterable` – haneulkim Nov 24 '20 at 04:07
  • 1
    For those that don't use the `.loc` syntax much and were a little confused how it worked here, these two short docs helped me get it better: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html, https://towardsdatascience.com/a-python-beginners-look-at-loc-part-1-cb1e1e565ec2 – lampShadesDrifter Jan 21 '21 at 04:57
  • +1 I got the warning `A value is trying to be set on a copy of a slice from a DataFrame.` when doing this in pandas v.1.3.4, but it works anyway – mikey Dec 08 '21 at 20:03
  • To add to @lampShadesDrifter comment, read the [second part of the TowardsDataScience article](https://towardsdatascience.com/a-python-beginners-look-at-loc-part-2-bddef7dfa7f2) to read about _setting_ values – Viet Than Aug 03 '22 at 16:27
52

You can use map, it can map vales from a dictonairy or even a custom function.

Suppose this is your df:

    ID First_Name Last_Name
0  103          a         b
1  104          c         d

Create the dicts:

fnames = {103: "Matt", 104: "Mr"}
lnames = {103: "Jones", 104: "X"}

And map:

df['First_Name'] = df['ID'].map(fnames)
df['Last_Name'] = df['ID'].map(lnames)

The result will be:

    ID First_Name Last_Name
0  103       Matt     Jones
1  104         Mr         X

Or use a custom function:

names = {103: ("Matt", "Jones"), 104: ("Mr", "X")}
df['First_Name'] = df['ID'].map(lambda x: names[x][0])
Rutger Kassies
  • 61,630
  • 17
  • 112
  • 97
  • 3
    Won't this generate a KeyError if the values do not exist in your dict? – EdChum Oct 07 '13 at 14:04
  • 1
    The custom function will, the others will work anyway. But i assumed the `dict` is created for the mapping. Otherwise some checking/cleaning can be done based on something like: `df.ID.isin(names.keys())` – Rutger Kassies Oct 07 '13 at 14:12
  • The custom function can be expanded into any (non anonymous) function. – user989762 Feb 12 '20 at 10:08
  • When I try this on my problem, I get AttributeError: 'DataFrame' object has no attribute 'map' – Liz Feb 07 '22 at 15:36
  • 1
    @Liz, that probably means you didn't select a single column (which makes it a Series). To do it on a DataFrame, you might consider using `.apply` combined with a `.map` on the resulting series. – Rutger Kassies Feb 08 '22 at 08:39
34

The original question addresses a specific narrow use case. For those who need more generic answers here are some examples:

Creating a new column using data from other columns

Given the dataframe below:

import pandas as pd
import numpy as np

df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])

In[1]:
Out[1]:
  animal     type  age
----------------------
0    dog    hound    5
1    cat  ragdoll    1

Below we are adding a new description column as a concatenation of other columns by using the + operation which is overridden for series. Fancy string formatting, f-strings etc won't work here since the + applies to scalars and not 'primitive' values:

df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal

In [2]: df
Out[2]:
  animal     type  age                description
-------------------------------------------------
0    dog    hound    5    A 5 years old hound dog
1    cat  ragdoll    1  A 1 years old ragdoll cat

We get 1 years for the cat (instead of 1 year) which we will be fixing below using conditionals.

Modifying an existing column with conditionals

Here we are replacing the original animal column with values from other columns, and using np.where to set a conditional substring based on the value of age:

# append 's' to 'age' if it's greater than 1
df.animal = df.animal + ", " + df.type + ", " + \
    df.age.astype(str) + " year" + np.where(df.age > 1, 's', '')

In [3]: df
Out[3]:
                 animal     type  age
-------------------------------------
0   dog, hound, 5 years    hound    5
1  cat, ragdoll, 1 year  ragdoll    1

Modifying multiple columns with conditionals

A more flexible approach is to call .apply() on an entire dataframe rather than on a single column:

def transform_row(r):
    r.animal = 'wild ' + r.type
    r.type = r.animal + ' creature'
    r.age = "{} year{}".format(r.age, r.age > 1 and 's' or '')
    return r

df.apply(transform_row, axis=1)

In[4]:
Out[4]:
         animal            type      age
----------------------------------------
0    wild hound    dog creature  5 years
1  wild ragdoll    cat creature   1 year

In the code above the transform_row(r) function takes a Series object representing a given row (indicated by axis=1, the default value of axis=0 will provide a Series object for each column). This simplifies processing since you can access the actual 'primitive' values in the row using the column names and have visibility of other cells in the given row/column.

ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • 1
    Thanks for taking the time to write up such a comprehensive answer. Much appreciated. – Parseltongue Jun 22 '18 at 01:42
  • Thanks for this extremely helpful answer. One follow-up - what if we want to modify a column by doing math on the column, rather than modifying a string? For instance, using the example above, what if we want to multiply the df.age column by 7 if df.animal=='dog'? Thank you! – BGG16 Jul 27 '20 at 16:51
  • 1
    @GbG: `np.where` is probably what you are looking for, see e.g. https://stackoverflow.com/a/42540310/191246 but it's also possible that you won't be able to fit the logic into a scalar operation, then you'd need to explicitly transform the cell numerically similar to how it's done in `transform_row` – ccpizza Jul 27 '20 at 16:55
  • Thank you @ccpizza! Just what I was looking for. – BGG16 Jul 28 '20 at 15:43
15

This question might still be visited often enough that it's worth offering an addendum to Mr Kassies' answer. The dict built-in class can be sub-classed so that a default is returned for 'missing' keys. This mechanism works well for pandas. But see below.

In this way it's possible to avoid key errors.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> class SurnameMap(dict):
...     def __missing__(self, key):
...         return ''
...     
>>> surnamemap = SurnameMap()
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap[x])
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         

The same thing can be done more simply in the following way. The use of the 'default' argument for the get method of a dict object makes it unnecessary to subclass a dict.

>>> import pandas as pd
>>> data = { 'ID': [ 101, 201, 301, 401 ] }
>>> df = pd.DataFrame(data)
>>> surnamemap = {}
>>> surnamemap[101] = 'Mohanty'
>>> surnamemap[301] = 'Drake'
>>> df['Surname'] = df['ID'].apply(lambda x: surnamemap.get(x, ''))
>>> df
    ID  Surname
0  101  Mohanty
1  201         
2  301    Drake
3  401         
Bill Bell
  • 21,021
  • 5
  • 43
  • 58
  • 1
    this is by far the best and easiest answer I've seen, with excellent default handling. Thank you. – Brendan Mar 12 '18 at 22:58
  • @Brendan: Oh! Thanks very much. – Bill Bell Mar 13 '18 at 02:59
  • For the people going with this solution, be wary of some [pitfalls and possible pains](https://treyhunner.com/2019/04/why-you-shouldnt-inherit-from-list-and-dict-in-python/) you're exposing yourself to when sub-classing from `dict` directly. – deepbrook Nov 18 '20 at 11:15
  • Probably `collections.defaultdict` could be used instead of subclassing the builtin `dict` class. Something like `d = {...}; dd = collections.defaultdict(lambda k: '...', d` and you can use `dd`. – Philippe Carphin Nov 21 '22 at 17:06
9
df['FirstName']=df['ID'].apply(lambda x: 'Matt' if x==103 else '')
df['LastName']=df['ID'].apply(lambda x: 'Jones' if x==103 else '')
mpriya
  • 823
  • 8
  • 15
  • 2
    The community encourages adding explanations to questions and not posting purely code answers (see [here](https://meta.stackoverflow.com/questions/300837/what-comment-should-i-add-to-code-only-answers)). – costaparas Jan 23 '21 at 03:45
  • 1
    Yet, in my opinion, this is the best, clearest, cleanest and most concise answer. – Dan Nissenbaum Mar 22 '21 at 05:42
1

In case someone is looking for a way to change the values of multiple rows based on some logical condition of each row itself, using .apply() with a function is the way to go.

df = pd.DataFrame({'col_a':[0,0], 'col_b':[1,2]})

   col_a  col_b
0      0      1
1      0      2

def func(row):
    if row.col_a == 0 and row.col_b <= 1:
        row.col_a = -1
        row.col_b = -1
    return row

df.apply(func, axis=1)

   col_a  col_b
0     -1     -1 # Modified row
1      0      2

Although .apply() is typically used to add a new row/column to a dataframe, it can be used to modify the values of existing rows/columns.

Gabriel Cia
  • 388
  • 5
  • 13
0

I found it much easier to debut by printing out where each row meets the condition:

for n in df.columns:
    if(np.where(df[n] == 103)):
        print(n)
        print(df[df[n] == 103].index)
Ruli
  • 2,592
  • 12
  • 30
  • 40