2

I have a column "Country" in a data frame, I would like to group the "Country" column with only two options: "Mainland China" and " Others". I have tried different options e.g. filter, etc. No one works. How should I do it?

Here is the dataset https://drive.google.com/file/d/17DY8f-Jxba0Ky5iOUQqEZehhoWNO3vzR/view?usp=sharing

FYI, I have already grouped different provinces in China as one country "Mainland China"

Thanks for your help!

enter image description here

almo
  • 561
  • 2
  • 16
  • can you post the data as code please, not a picture – gold_cy Feb 09 '20 at 21:03
  • one min, I will do it – almo Feb 09 '20 at 21:04
  • interesting dataset – Arco Bast Feb 09 '20 at 21:43
  • Do not share information as images unless absolutely necessary. See: https://meta.stackoverflow.com/questions/303812/discourage-screenshots-of-code-and-or-errors, [mcve]. This seems rather basic, have you read the Pandas docs? – AMC Feb 10 '20 at 01:44
  • Oh, and this is essentially a duplicate of https://stackoverflow.com/questions/19913659/pandas-conditional-creation-of-a-series-dataframe-column. – AMC Feb 10 '20 at 02:07

4 Answers4

3

I think the quickest way to change the value would be using .loc instead of apply since .loc is optimized for pandas.

df.loc[df.Country != 'Mainland China', 'Country'] = 'Others'
gold_cy
  • 13,648
  • 3
  • 23
  • 45
1

Try (and group by Country):

import numpy as np

df["Country"]=np.where(df["Country"].eq("Mainland China"), "Mainland China", "Other")

Edit

timeit (please note I didn't do .loc[] as lambda doesn't support assignment - feel free to suggest a way of adding it):

import pandas as pd
import numpy as np
import timeit
from timeit import Timer

#proportion-wise that's the dataframe, as per OP's question

df=pd.DataFrame({"Country": ["Mainland China"]*398+["a", "b","c"]*124})

df["otherCol"]=2
df["otherCol2"]=3

#shuffle

df2=df.copy().sample(frac=1)
df3=df2.copy()
df4=df3.copy()

op2=Timer(lambda: np.where(df2["Country"].eq("Mainland China"), "Mainland China", "Other"))
op3=Timer(lambda: df3.Country.map(lambda x: x if x == 'Mainland China' else 'Others'))
op4=Timer(lambda: df4["Country"].apply(lambda x: x if x == "Mainland China" else "Others"))

print(op2.timeit(number=1000))
print(op3.timeit(number=1000))
print(op4.timeit(number=1000))

Returns:

2.1856687490362674 #numpy
2.2388894270407036 #map
2.4437739049317315 #apply
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • Thanks Grazegorz, even though your solution comes later than those two guys, I know one way extra solving this problem. Thank you :D – almo Feb 09 '20 at 21:16
  • 2
    No worries- time them - you will have some criteria to compare ;) I would expect ```np.where``` to be a bit faster than ```.loc[...]``` ```.apply(...)``` is outside of competition here. – Grzegorz Skibinski Feb 09 '20 at 21:21
  • What is the advantage of using this over `.loc[]`, aside from a tiny performance gain? – AMC Feb 10 '20 at 01:45
  • Looking at https://stackoverflow.com/a/31173785/5082048, performance might be lower than for `.map(lambda x: ...)` for small datasets. List comprehensions scored best in that benchmark. – Arco Bast Feb 10 '20 at 23:04
  • 2
    I benchmarked all the methods except ```.loc[]``` - please see above. – Grzegorz Skibinski Feb 11 '20 at 07:36
  • @GrzegorzSkibinski I just see it. Great to know. Thanks! – almo Feb 11 '20 at 16:20
  • Awesome, take your well deserved upvote! For benchmarking loc: what about defining two functions: The first just copies the dataframe. The second copies it, then assigns the column. The time of executing the `loc`statement would be the difference between the two. Also, out of interest, do you also want to add the list comprehension method? – Arco Bast Feb 11 '20 at 19:42
-1

Try using apply:

dataframe["Country"] = dataframe["Country"].apply(lambda x: x if x == "Mainland China" else "Others")
Oliver Ni
  • 2,606
  • 7
  • 29
  • 44
  • thanks, your solution works perfectly as well. Since I have already accepted one solution, I sincerely appreciate your help ! – almo Feb 09 '20 at 21:14
  • @AMC Thanks for involving in the discussion! I suppose that we should respect everybody's efforts. What do you think :) – almo Feb 10 '20 at 07:07
  • @almo I agree entirely, my statement was in no way related to the answerer’s person or character. – AMC Feb 10 '20 at 07:10
  • 1
    @AMC on the plus side, it is quite flexible if other categories need to be defined in the future. – Arco Bast Feb 10 '20 at 16:51
-2

Assuming df is your pandas dataframe.

You could do:

df['Country'] = df.Country.map(lambda x: x if x == 'Mainland China' else 'Others')
Arco Bast
  • 3,595
  • 2
  • 26
  • 53
  • 1
    Thanks, perfect:D. I have wasted nearly one hour. – almo Feb 09 '20 at 21:11
  • @AMC on the plus side, it is quite flexible if other categories need to be defined in the future. – Arco Bast Feb 10 '20 at 16:49
  • @ArcoBast You could just use `.map()` and a dictionary, which is likely the most flexible solution. – AMC Feb 10 '20 at 21:33
  • @AMC I thought about this, but mapping everything except 'Mainland China' to a single value is not straightforward with a dictionary. I could have suggested using a defaultdict, of course, but considered that to be overkill. From an analyst's point of view, when working with a small dataset like this one, flexibility beats speed in my experience. – Arco Bast Feb 10 '20 at 22:31
  • @ArcoBast It's certainly a unique situation, yes. For this particular case, I like the solution using `.loc[]`. As soon as the number of values to map changes, I prefer `.map()` with a dict. – AMC Feb 10 '20 at 22:52
  • I think the fact that the changes are being made on the same column does have an effect. – AMC Feb 10 '20 at 22:53