2

I have dataframe shown in below: column name 'Types'shows each types dified

I would like to add another column named 'number' defined as below.

df=pd.DataFrame({'Sex':['M','F','F','M'],'Age':[30,31,33,32],'Types':['A','C','B','D']})

Out[8]: 

    Age Sex  Types
0   30   M      A
1   31   F      C
2   33   F      B
3   32   M      D

and I have another male table below; each column represents Types!

(It was difficult to create table for me, Are there another easy way to create?)

table_M = pd.DataFrame(np.arange(20).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_M.index.name="Age(male)"

         A      B      C      D      E
Age(male)                                   
30       0      1      2      3      4
31       5      6      7      8      9
32      10     11     12     13     14
33      15     16     17     18     19

and I have female table below;

table_F = pd.DataFrame(np.arange(20,40).reshape(4,5),index=[30,31,32,33],columns=["A","B","C","D","E"])
table_F.index.name="Age(female)"

        A      B      C      D      E
Age(female)                                   
30      20     21     22     23     24
31      25     26     27     28     29
32      30     31     32     33     34
33      35     36     37     38     39

so I would like to add 'number' column as shown below;

    Age Sex  Types   number
0   30   M      A      0 
1   31   F      C     27
2   33   F      B     36
3   32   M      D     13

this number column refer to female and male table. for each age , Type, and Sex. It was too complicated for me. Can I ask how to add 'number' column?

Heisenberg
  • 4,787
  • 9
  • 47
  • 76

3 Answers3

5

I suggest reshaping your male and female tables:

males = (table_M.stack().to_frame('number').assign(Sex='M').reset_index()
                .rename(columns={'Age(male)': 'Age', 'level_1': 'Types'}))

females = (table_F.stack().to_frame('number').assign(Sex='F').reset_index()
                  .rename(columns={'Age(female)': 'Age', 'level_1': 'Types'}))

reshaped = pd.concat([males, females], ignore_index=True)

Then merge:

df.merge(reshaped)
Out: 
   Age Sex Types  number
0   30   M     A       0
1   31   F     C      27
2   33   F     B      36
3   32   M     D      13

What this does is it stacks the columns of Male and Female tables, and assigns an indicator column showing Sex ('M' and 'F'). females.head() looks like this:

females.head()
Out: 
   Age Types  number Sex
0   30     A      20   F
1   30     B      21   F
2   30     C      22   F
3   30     D      23   F
4   30     E      24   F

and males.head():

males.head()
Out: 
   Age Types  number Sex
0   30     A       0   M
1   30     B       1   M
2   30     C       2   M
3   30     D       3   M
4   30     E       4   M

With pd.concat these two are combined into a single DataFrame and merge by default works on the common columns so it looks for the matches in 'Age', 'Sex', 'Types' columns and merge two DataFrames based on that.


One other possibility is to use df.lookup:

df.loc[df['Sex']=='M', 'number'] = table_M.lookup(*df.loc[df['Sex']=='M', ['Age', 'Types']].values.T)
df.loc[df['Sex']=='F', 'number'] = table_F.lookup(*df.loc[df['Sex']=='F', ['Age', 'Types']].values.T)

df
Out: 
   Age Sex Types  number
0   30   M     A     0.0
1   31   F     C    27.0
2   33   F     B    36.0
3   32   M     D    13.0

This looks up the males in table_M, and females in table_F.

ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Creating new data frames is not necessary. See my answer. – Nehal J Wani Aug 28 '16 at 12:53
  • @NehalJWani apply with `axis=1` is highly inefficient and you should avoid it whenever you have a vectorized alternative. With only four rows that might be OK but if you have millions of records that would take forever to run. Other than that, tidying the dataset is never unnecessary in my opinion. It makes the subsequent steps much more easier in your analysis. – ayhan Aug 28 '16 at 13:12
  • With 10 thousand rows apply took 41.1 seconds while stack/concat/merge finished in 482 ms and most of it is overhead (if you increase it to 1 million rows it takes 672 ms instead). – ayhan Aug 28 '16 at 13:15
  • 2
    Ah, I see. Thanks for the info :) Something new to learn everyday! – Nehal J Wani Aug 28 '16 at 13:16
  • Thankyou for early reply! very simple and applicable answer! I have a one question.why you use asterisk * inside the lookup fuc? – Heisenberg Aug 28 '16 at 14:51
  • 1
    @MiyashitaHikaru lookup has two arguments: row_labels and col_labels. `df.loc[df['Sex']=='M', ['Age', 'Types']].values.T` on the other hand is a single numpy array containing two other numpy arrays. `*` operator unpacks that main array so the first array is row_labels and the second array is col_labels. Here's [a more detailed explanation](http://stackoverflow.com/questions/5239856/foggy-on-asterisk-in-python). – ayhan Aug 28 '16 at 15:25
4

It's easier if you two tables are combined such that you can access the 'Sex' via an apply.

table = pd.concat([table_F, table_M], axis=1, keys=['F', 'M'])

accessor = lambda row: table.loc[row.Age, (row.Sex, row.Types)]
df['number'] = df.apply(accessor, axis=1)
df

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
1

Another way to do this:

In [60]: df['numbers'] = df.apply(lambda x: table_F.loc[[x.Age]][x.Types].iloc[0] if x.Sex == 'F' else table_M.loc[[x.Age]][x.Types].iloc[0], axis = 1)

In [60]: df
Out[60]: 
   Age Sex Types  numbers
0   30   M     A        0
1   31   F     C       27
2   33   F     B       36
3   32   M     D       13
Nehal J Wani
  • 16,071
  • 3
  • 64
  • 89