Replace some values with a string in a column in python

Question

I have a column "Country" in a data frame, I would like to group the "Country" column with only two options: "Mainland China" and " Others". I have tried different options e.g. filter, etc. No one works. How should I do it?

Here is the dataset https://drive.google.com/file/d/17DY8f-Jxba0Ky5iOUQqEZehhoWNO3vzR/view?usp=sharing

FYI, I have already grouped different provinces in China as one country "Mainland China"

Thanks for your help!

Do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors, [mcve]. This seems rather basic, have you read the Pandas docs? — AMC, Feb 10 '20 at 01:44
Oh, and this is essentially a duplicate of https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column. — AMC, Feb 10 '20 at 02:07

score 3 · Answer 1 · answered Feb 09 '20 at 21:12

3

I think the quickest way to change the value would be using .loc instead of apply since .loc is optimized for pandas.

df.loc[df.Country != 'Mainland China', 'Country'] = 'Others'

answered Feb 09 '20 at 21:12

gold_cy

13,648
3
23
45

Thanks aws_apprentice! I tried that way. But apparently, I made some mistakes – almo Feb 09 '20 at 21:19

Grzegorz Skibinski · Answer 2 · 2020-02-11T07:35:55.880

1

Try (and group by Country):

import numpy as np

df["Country"]=np.where(df["Country"].eq("Mainland China"), "Mainland China", "Other")

Edit

timeit (please note I didn't do .loc[] as lambda doesn't support assignment - feel free to suggest a way of adding it):

import pandas as pd
import numpy as np
import timeit
from timeit import Timer

#proportion-wise that's the dataframe, as per OP's question

df=pd.DataFrame({"Country": ["Mainland China"]*398+["a", "b","c"]*124})

df["otherCol"]=2
df["otherCol2"]=3

#shuffle

df2=df.copy().sample(frac=1)
df3=df2.copy()
df4=df3.copy()

op2=Timer(lambda: np.where(df2["Country"].eq("Mainland China"), "Mainland China", "Other"))
op3=Timer(lambda: df3.Country.map(lambda x: x if x == 'Mainland China' else 'Others'))
op4=Timer(lambda: df4["Country"].apply(lambda x: x if x == "Mainland China" else "Others"))

print(op2.timeit(number=1000))
print(op3.timeit(number=1000))
print(op4.timeit(number=1000))

Returns:

2.1856687490362674 #numpy
2.2388894270407036 #map
2.4437739049317315 #apply

edited Feb 11 '20 at 07:35

answered Feb 09 '20 at 21:14

Grzegorz Skibinski

12,624
2
11
34

Thanks Grazegorz, even though your solution comes later than those two guys, I know one way extra solving this problem. Thank you :D – almo Feb 09 '20 at 21:16
2

No worries- time them - you will have some criteria to compare ;) I would expect ```np.where``` to be a bit faster than ```.loc[...]``` ```.apply(...)``` is outside of competition here. – Grzegorz Skibinski Feb 09 '20 at 21:21
What is the advantage of using this over `.loc[]`, aside from a tiny performance gain? – AMC Feb 10 '20 at 01:45
Looking at https://stackoverflow.com/a/31173785/5082048, performance might be lower than for `.map(lambda x: ...)` for small datasets. List comprehensions scored best in that benchmark. – Arco Bast Feb 10 '20 at 23:04
2

I benchmarked all the methods except ```.loc[]``` - please see above. – Grzegorz Skibinski Feb 11 '20 at 07:36
@GrzegorzSkibinski I just see it. Great to know. Thanks! – almo Feb 11 '20 at 16:20
Awesome, take your well deserved upvote! For benchmarking loc: what about defining two functions: The first just copies the dataframe. The second copies it, then assigns the column. The time of executing the `loc`statement would be the difference between the two. Also, out of interest, do you also want to add the list comprehension method? – Arco Bast Feb 11 '20 at 19:42

score -1 · Answer 3 · answered Feb 09 '20 at 21:06

-1

Try using apply:

dataframe["Country"] = dataframe["Country"].apply(lambda x: x if x == "Mainland China" else "Others")

answered Feb 09 '20 at 21:06

Oliver Ni

2,606
7
29
44

thanks, your solution works perfectly as well. Since I have already accepted one solution, I sincerely appreciate your help ! – almo Feb 09 '20 at 21:14
@AMC Thanks for involving in the discussion! I suppose that we should respect everybody's efforts. What do you think :) – almo Feb 10 '20 at 07:07
@almo I agree entirely, my statement was in no way related to the answerer’s person or character. – AMC Feb 10 '20 at 07:10
1

@AMC on the plus side, it is quite flexible if other categories need to be defined in the future. – Arco Bast Feb 10 '20 at 16:51

score -2 · Answer 4 · answered Feb 09 '20 at 21:06

-2

Assuming df is your pandas dataframe.

You could do:

df['Country'] = df.Country.map(lambda x: x if x == 'Mainland China' else 'Others')

answered Feb 09 '20 at 21:06

Arco Bast

3,595
2
26
53

1

Thanks, perfect:D. I have wasted nearly one hour. – almo Feb 09 '20 at 21:11
@AMC on the plus side, it is quite flexible if other categories need to be defined in the future. – Arco Bast Feb 10 '20 at 16:49
@ArcoBast You could just use `.map()` and a dictionary, which is likely the most flexible solution. – AMC Feb 10 '20 at 21:33
@AMC I thought about this, but mapping everything except 'Mainland China' to a single value is not straightforward with a dictionary. I could have suggested using a defaultdict, of course, but considered that to be overkill. From an analyst's point of view, when working with a small dataset like this one, flexibility beats speed in my experience. – Arco Bast Feb 10 '20 at 22:31
@ArcoBast It's certainly a unique situation, yes. For this particular case, I like the solution using `.loc[]`. As soon as the number of values to map changes, I prefer `.map()` with a dict. – AMC Feb 10 '20 at 22:52
I think the fact that the changes are being made on the same column does have an effect. – AMC Feb 10 '20 at 22:53

Replace some values with a string in a column in python

4 Answers4