2

I have a Dataframe called df_base that looks like this. As you can see, there's a column called Sex that's male or female. I want to map these values to 0 and 1, respectively.

+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
|   | PassengerId | Survived | Pclass |                       Name                        |  Sex   | Age | SibSp | Parch |      Ticket      |  Fare   | Cabin | Embarked |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+
| 0 |           1 |        0 |      3 | Braund, Mr. Owen Harris                           | male   |  22 |     1 |     0 | A/5 21171        |    7.25 | NaN   | S        |
| 1 |           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female |  38 |     1 |     0 | PC 17599         | 71.2833 | C85   | C        |
| 2 |           3 |        1 |      3 | Heikkinen, Miss. Laina                            | female |  26 |     0 |     0 | STON/O2. 3101282 |   7.925 | NaN   | S        |
| 3 |           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)      | female |  35 |     1 |     0 | 113803           |    53.1 | C123  | S        |
| 4 |           5 |        0 |      3 | Allen, Mr. William Henry                          | male   |  35 |     0 |     0 | 373450           |    8.05 | NaN   | S        |
+---+-------------+----------+--------+---------------------------------------------------+--------+-----+-------+-------+------------------+---------+-------+----------+

There's a few methods that I've seen dotted about on StackOverflow but I'm wondering what the most efficient is to perform the following mapping:

+---------+---------+
| Old Sex | New Sex |
+---------+---------+
| male    |       0 |
| female  |       1 |
| female  |       1 |
| female  |       1 |
| male    |       0 |
+---------+---------+

I'm using this:

df_base['Sex'].replace(['male','female'],[0,1],inplace=True)

... but I can't help but feel as though this is a little shoddy. Is there a better way of doing this? There's also using .loc but that loops around the rows of the Dataframe, so is less efficient, right?

Aaraeus
  • 1,105
  • 3
  • 14
  • 26
  • 1
    Why do I not see people suggesting duplicates when the question is _actually_ a duplicate? – cs95 Jan 05 '19 at 14:01
  • @coldspeed, I never got to know about this thread So, couldn't broadcast it as dups. – Karn Kumar Jan 05 '19 at 14:13
  • 2
    @pygo I have nothing against people suggesting dupes, but I dislike double standards. I see it everywhere, and more for some users than others. – cs95 Jan 05 '19 at 14:15
  • @coldspeed - Yes, standard solution is replace/map, but if OP need performnce there are also another solutions, so no closed by dupe from me. – jezrael Jan 05 '19 at 14:31
  • 1
    @jezrael Every question can be left open with that reasoning. I am sure you have used the same reasoning to answer similar questions before, so find one of those and mark as dupe. Or add an answer to the dupe target. You will get much more rep that way, than answering questions like this over and over again. – cs95 Jan 05 '19 at 14:33
  • @coldspeed - yes, but here are only 2 values in `Sex` column, so it is only special case, so not possible add answer to dupe - it is good general solution for replace for many values... Because only 2 replacement values is possible faster solution. – jezrael Jan 05 '19 at 14:35

3 Answers3

7

I think here is better/faster use map by dictionary if only male and female exist in column Sex:

df_base['Sex'] = df_base['Sex'].map(dict(zip(['male','female'],[0,1]))

What is same like:

df_base['Sex'] = df_base['Sex'].map({'male': 0,'female': 1})

Solution if exist only female and male values is cast boolean mask to integers True/False to 1,0:

df_base['Sex'] = (df_base['Sex'] == 'female').astype(int)

Performance:

np.random.seed(2019)

import perfplot    

def ma(df):
    df = df.copy()
    df['Sex_new'] = df['Sex'].map({'male': 0,'female': 1})
    return df

def rep1(df):
    df = df.copy()
    df['Sex'] = df['Sex'].replace(['male','female'],[0,1])
    return df

def nwhere(df):
    df = df.copy()
    df['Sex_new'] = np.where(df['Sex'] == 'male', 0, 1)
    return df

def mask1(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'] == 'female').astype(int)
    return df

def mask2(df):
    df = df.copy()
    df['Sex_new'] = (df['Sex'].values == 'female').astype(int)
    return df


def make_df(n):
    df = pd.DataFrame({'Sex': np.random.choice(['male','female'], size=n)})

    return df

perfplot.show(
    setup=make_df,
    kernels=[ma,  rep1, nwhere, mask1, mask2],
    n_range=[2**k for k in range(2, 18)],
    logx=True,
    logy=True,
    equality_check=False,  # rows may appear in different order
    xlabel='len(df)')

pic

Conclusion:

If replace only 2 values is slowiest replace, numpy.where, map and mask are similar. For improve performance compare by numpy array with .values.
Also all depends of data, so best test with real data.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

My instinct would have suggested to use .map(), but I made a comparison between your solution and map, based on a dataframe with 1500 random male/female values.

%timeit df_base['Sex_new'] = df_base['Sex'].map({'male': 0,'female': 1})
1000 loops, best of 3: 653 µs per loop

Edited Based on coldspeeds comment, and because reassigning it is a better comparison with the others:

%timeit df_base['Sex_new'] = df_base['Sex'].replace(['male','female'],[0,1])
1000 loops, best of 3: 968 µs per loop

So actually slower .map()...!

So based on this example, your 'shoddy' solution seems faster than .map()...

Edit

pygo's solution:

%timeit df_base['Sex_new'] = np.where(df_base['Sex'] == 'male', 0, 1)
1000 loops, best of 3: 331 µs per loop

So faster!

Jezrael's solution with .astype(int):

%timeit df_base['Sex_new'] = (df_base['Sex'] == 'female').astype(int)
1000 loops, best of 3: 388 µs per loop

So also faster than .map() and .replace().

Niels Henkens
  • 2,553
  • 1
  • 12
  • 27
  • What about `df['Old_Sex'] = np.where(df['Sex'] == 'male', 0, 1)` comparison? – Karn Kumar Jan 05 '19 at 14:05
  • And check [this](https://stackoverflow.com/questions/49259580/replace-values-in-a-pandas-series-via-dictionary-efficiently) for `map/replace` comparison – jezrael Jan 05 '19 at 14:06
  • Yes, Solution with numpy will be faster. – Karn Kumar Jan 05 '19 at 14:07
  • Or `df_base['Sex'] = (df_base['Sex'].values == 'female').astype(int)` – jezrael Jan 05 '19 at 14:07
  • 1
    Note that `df_base['Sex'].replace(['male','female'],[0,1],inplace=True)` will not work because `inplace=True` on Series with replace has a history of being problematic. Assign it back. – cs95 Jan 05 '19 at 14:15
  • 1
    Thank you for timings, but I get something else, `replace` is slowier for me, check my answer. What is your code for testing performance? – jezrael Jan 05 '19 at 14:32
  • @jezrael: It is for me as well if I reassign it. See the edit (I posted this already at the bottom, but made it more prominent now). I really like your plot of the different options! – Niels Henkens Jan 05 '19 at 14:40
  • @NielsHenkens - Yes, I like it too, I learn it from unutbu :) – jezrael Jan 05 '19 at 14:43
  • I popularised `perfplot` in [this](https://stackoverflow.com/a/54028200/4909087) answer I wrote recently – cs95 Jan 05 '19 at 14:44
  • @jezrael those perf plots are very nice , i'll try ti learn it :-) – Karn Kumar Jan 05 '19 at 14:46
  • @coldspeed, the post is really nice , However what method you are using to plot the graphs perfplot? – Karn Kumar Jan 05 '19 at 14:50
  • @pygo yes, scroll to the bottom of the post, I've added all code snippets there. – cs95 Jan 05 '19 at 14:51
1

Another solution you can use with np.where:

Just an example DataFrame:

>>> df
      Sex
0    male
1  female
2  female
3  female
4    male

Based on the condition create new column new_Sex

>>> df['new_Sex'] = np.where(df['Sex'] == 'male', 0, 1)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0

OR:

>>> df['new_Sex'] = np.where(df['Sex'] != 'male', 1, 0)
>>> df
      Sex  new_Sex
0    male        0
1  female        1
2  female        1
3  female        1
4    male        0
Karn Kumar
  • 8,518
  • 3
  • 27
  • 53