169

I have a dataframe with some columns like this:

A   B   C  
0   
4
5
6
7
7
6
5

The possible range of values in A are only from 0 to 7.

Also, I have a list of 8 elements like this:

List=[2,5,6,8,12,16,26,32]  //There are only 8 elements in this list

If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.

How can I do this in one go without looping over the whole dataframe?

The resulting dataframe would look like this:

A   B   C   D
0           2
4           12
5           16
6           26
7           32
7           32
6           26
5           16

Note: The dataframe is huge and iteration is the last option option. But I can also arrange the elements in 'List' in any other data structure like dict if necessary.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
mane
  • 2,093
  • 4
  • 16
  • 14

6 Answers6

424

Just assign the list directly:

df['new_col'] = mylist

Alternative
Convert the list to a series or array and then assign:

se = pd.Series(mylist)
df['new_col'] = se.values

or

df['new_col'] = np.array(mylist)
sparrow
  • 10,794
  • 12
  • 54
  • 74
  • 6
    `pykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """Entry point for launching an IPython kernel.` – franchb Feb 01 '18 at 15:54
  • @sparrow will using `pd.Series` effect the dtype? I mean will it leave floats as floats and strings as strings? Or will the elements within the list default to strings? – 3kstc Feb 27 '18 at 03:23
  • 2
    @IlyaRusin, it's a false positive which can be ignored in this case. For more info: https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – sparrow Aug 14 '18 at 21:47
  • 2
    This can be simplified to: df['new_col'] = pd.Series(mylist).values – smartse Nov 05 '18 at 19:00
61

IIUC, if you make your (unfortunately named) List into an ndarray, you can simply index into it naturally.

>>> import numpy as np
>>> m = np.arange(16)*10
>>> m[df.A]
array([  0,  40,  50,  60, 150, 150, 140, 130])
>>> df["D"] = m[df.A]
>>> df
    A   B   C    D
0   0 NaN NaN    0
1   4 NaN NaN   40
2   5 NaN NaN   50
3   6 NaN NaN   60
4  15 NaN NaN  150
5  15 NaN NaN  150
6  14 NaN NaN  140
7  13 NaN NaN  130

Here I built a new m, but if you use m = np.asarray(List), the same thing should work: the values in df.A will pick out the appropriate elements of m.


Note that if you're using an old version of numpy, you might have to use m[df.A.values] instead-- in the past, numpy didn't play well with others, and some refactoring in pandas caused some headaches. Things have improved now.

edge-case
  • 1,128
  • 2
  • 14
  • 32
DSM
  • 342,061
  • 65
  • 592
  • 494
  • Hi @DSM. I get what you are saying but I am getting this error: `Traceback (most recent call last):` `File "./b.py", line 24, in ` `d["D"] = m[d.A]` `IndexError: unsupported iterator index` – mane Oct 31 '14 at 03:44
  • 1
    @mane: urf, that's an old `numpy` bug. Does `d["D"] = m[d.A.values]` work for you? – DSM Oct 31 '14 at 03:51
21

A solution improving on the great one from @sparrow.

Let df, be your dataset, and mylist the list with the values you want to add to the dataframe.

Let's suppose you want to call your new column simply, new_column

First make the list into a Series:

column_values = pd.Series(mylist)

Then use the insert function to add the column. This function has the advantage to let you choose in which position you want to place the column. In the following example we will position the new column in the first position from left (by setting loc=0)

df.insert(loc=0, column='new_column', value=column_values)
erip
  • 16,374
  • 11
  • 66
  • 121
Salvatore Cosentino
  • 6,663
  • 6
  • 17
  • 25
  • This will not work if you changed your indexes of df to something other then 1,2,3... in that case you have to add between the lines: column_values.index=df.index – Guy s Mar 16 '19 at 17:47
9

Old question; but I always try to use fastest code!

I had a huge list with 69 millions of uint64. np.array() was fastest for me.

df['hashes'] = hashes
Time spent: 17.034842014312744

df['hashes'] = pd.Series(hashes).values
Time spent: 17.141014337539673

df['key'] = np.array(hashes)
Time spent: 10.724546194076538
Mehdi
  • 999
  • 13
  • 11
8

First let's create the dataframe you had, I'll ignore columns B and C as they are not relevant.

df = pd.DataFrame({'A': [0, 4, 5, 6, 7, 7, 6,5]})

And the mapping that you desire:

mapping = dict(enumerate([2,5,6,8,12,16,26,32]))

df['D'] = df['A'].map(mapping)

Done!

print df

Output:

   A   D
0  0   2
1  4  12
2  5  16
3  6  26
4  7  32
5  7  32
6  6  26
7  5  16
Toby Seo
  • 457
  • 1
  • 4
  • 14
Phil Cooper
  • 5,747
  • 1
  • 25
  • 41
  • 1
    I think the OP knows how to do this already. By my reading the issue is constructing `D` from the elements of `A` and `List` ("If the element in column A is n, I need to insert the n th element from the List in a new column, say 'D'.") – DSM Oct 31 '14 at 03:39
  • SO has turned into some kind of F(*& nanny state. Thanks to @DSM for the comment but I couldn't correct the post untill it was peer reviewed. and then it was rejected because it was too fast. and then I was able to peer review my own edit. and then it's too late because a worse (IMHO) answer was "accepted". SO is really got some meta-nanny's who are less than helpful!!!! – Phil Cooper Oct 31 '14 at 04:01
  • Well, I can't speak for the nannies, but you'll find that your approach is about an order of magnitude slower on long arrays. In other respects, of course, choosing between `np.array(List)[df.A]` and `df["A"].map(dict(enumerate(List)))` is mostly a matter of preference. – DSM Oct 31 '14 at 04:11
  • Hi Phil, I only saw your solution and DSM's comment and then never got back to it since DSM's solution worked fine for me. But now looking at your solution, it works too. I have run DSM's solution on my dataset of about 200k entries and it runs in a couple of seconds with all the other calculations that I have. I am totally new to python-pandas and personally was not looking for anything elegant or great; whatever worked was fine. But honestly, thanks for the solution. – mane Oct 31 '14 at 05:31
6

You can also use df.assign:

In [1559]: df
Out[1559]: 
   A   B   C
0  0 NaN NaN
1  4 NaN NaN
2  5 NaN NaN
3  6 NaN NaN
4  7 NaN NaN
5  7 NaN NaN
6  6 NaN NaN
7  5 NaN NaN

In [1560]: mylist = [2,5,6,8,12,16,26,32]

In [1567]: df = df.assign(D=mylist)

In [1568]: df
Out[1568]: 
   A   B   C   D
0  0 NaN NaN   2
1  4 NaN NaN   5
2  5 NaN NaN   6
3  6 NaN NaN   8
4  7 NaN NaN  12
5  7 NaN NaN  16
6  6 NaN NaN  26
7  5 NaN NaN  32
Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58