3

I'm just getting into pandas and I am trying to add a new column to an existing dataframe.

I have two dataframes where the index of one data frame links to a column in another dataframe. Where these values are equal I need to put the value of another column in the source dataframe in a new column of the destination column.

The code section below illustrates what I mean. The commented part is what I need as an output.

I guess I need the .loc[] function.

Another, minor, question: is it bad practice to have a non-unique indexes?

import pandas as pd

d = {'key':['a',  'b', 'c'], 
     'bar':[1, 2, 3]}

d2 = {'key':['a', 'a', 'b'],
      'other_data':['10', '20', '30']}

df = pd.DataFrame(d)
df2 = pd.DataFrame(data = d2)
df2 = df2.set_index('key')

print df2

##    other_data  new_col
##key           
##a            10   1
##a            20   1
##b            30   2
Billal Begueradj
  • 20,717
  • 43
  • 112
  • 130
ArnJac
  • 352
  • 2
  • 5
  • 15

5 Answers5

9

Use rename index by Series:

df2['new'] = df2.rename(index=df.set_index('key')['bar']).index
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

Or map:

df2['new'] = df2.index.to_series().map(df.set_index('key')['bar'])
print (df2)

    other_data  new
key                
a           10    1
a           20    1
b           30    2

If want better performance, the best is avoid duplicates in index. Also some function like reindex failed in duplicates index.

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
3

You can use join

df2.join(df.set_index('key'))

    other_data  bar
key                
a           10    1
a           20    1
b           30    2

One way to rename the column in the process

df2.join(df.set_index('key').bar.rename('new'))

    other_data  new
key                
a           10    1
a           20    1
b           30    2
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

With the help of .loc

df2['new'] = df.set_index('key').loc[df2.index]

Output :

   other_data  new
key                
a           10    1
a           20    1
b           30    2
Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108
2

Another, minor, question: is it bad practice to have a non-unique indexes?

It is not great practice, but depends on your needs and can be okay in some circumstances.

Issue 1: join operations

A good place to start is to think about what makes an Index different from a standard DataFrame column. This engenders the question: if your Index has duplicate values, does it really need to be specified as an Index, or could it just be another column in a RangeIndex-ed DataFrame? If you've ever used SQL or any other DMBS and want to mimic join operations in pandas with functions such as .join or .merge, you'll lose the functionality of a primary key if you have duplicate index values. A merge will give you what is basically a cartesian product--probably not what you're looking for.

For example:

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
df2 = df.rename(columns={0: 'a', 1 : 'b'})
print(df.merge(df2, left_index=True, right_index=True).head(7))
         0        1        a        b
a  0.73737  1.49073  0.73737  1.49073
a  0.73737  1.49073 -0.25562 -2.79859
a -0.25562 -2.79859  0.73737  1.49073
a -0.25562 -2.79859 -0.25562 -2.79859
b -0.93583  1.17583 -0.93583  1.17583
b -0.93583  1.17583 -1.77153 -0.69988
b -1.77153 -0.69988 -0.93583  1.17583

Issue 2: performance

Unique-valued indices make certain operations efficient, as explained in this post.

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

A word on .loc

Using .loc will return all instances of the label. This can be a blessing or a curse depending on what your objective is. For example,

df = pd.DataFrame(np.random.randn(10,2),
                  index=2*list('abcde'))
print(df.loc['a'])
         0        1
a  0.73737  1.49073
a -0.25562 -2.79859
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
1

Using combine_first

In [442]: df2.combine_first(df.set_index('key')).dropna()
Out[442]:
     bar other_data
key
a    1.0         10
a    1.0         20
b    2.0         30

Or, using map

In [461]: df2.assign(bar=df2.index.to_series().map(df.set_index('key')['bar']))
Out[461]:
    other_data  bar
key
a           10    1
a           20    1
b           30    2
Zero
  • 74,117
  • 18
  • 147
  • 154