12

I'm working with Pandas and I have a data frame where we can have one of three values populated:

ID_1    ID_2    ID_3
abc     NaN     NaN
NaN     def     NaN
NaN     NaN     ghi
NaN     NaN     jkl
NaN     mno     NaN
pqr     NaN     NaN

And my goal is to combine these three columns into a new columns in my data frame:

ID_1    ID_2    ID_3  Combined_ID
abc     NaN     NaN    abc
NaN     def     NaN    def
NaN     NaN     ghi    ghi
NaN     NaN     jkl    jkl
NaN     mno     NaN    mno
pqr     NaN     NaN    pqr

Ideally it would just find whatever not null value exists in columns 1 through 3, but I could also concatenate since we should only have one of the three populated for each row. Thanks.

df_note = pd.read_csv("NoteIds.csv")
df_note['Combined_ID'] = # ID_1 + ID_2 + ID_3
EMC
  • 699
  • 1
  • 9
  • 16
  • 2
    Are we certain that there's only one non-NaN value per row? – DSM Apr 21 '15 at 20:56
  • @EdChum - There are actually other columns as well. So yes, only one non NaN but only for those 3 columns, not the entire row. – EMC Apr 21 '15 at 21:07

3 Answers3

13

You can use the property that summing will concatenate the string values, so you could call fillna and pass an empty str and the call sum and pass param axis=1 to sum row-wise:

In [26]:

df['Combined_ID'] = df.fillna('').sum(axis=1)
df
Out[26]:
  ID_1 ID_2 ID_3 Combined_ID
0  abc  NaN  NaN         abc
1  NaN  def  NaN         def
2  NaN  NaN  ghi         ghi
3  NaN  NaN  jkl         jkl
4  NaN  mno  NaN         mno
5  pqr  NaN  NaN         pqr

If you're only interested in those 3 columns you can just select them:

In [39]:

df['Combined_ID'] = df[['ID_1','ID_2','ID_3']].fillna('').sum(axis=1)
df
Out[39]:
  ID_1 ID_2 ID_3 Combined_ID
0  abc  NaN  NaN         abc
1  NaN  def  NaN         def
2  NaN  NaN  ghi         ghi
3  NaN  NaN  jkl         jkl
4  NaN  mno  NaN         mno
5  pqr  NaN  NaN         pqr
EdChum
  • 376,765
  • 198
  • 813
  • 562
3

Let's assume that there can be more than one non-NaN value per row. Still this should work.

In [43]: df['Combined_ID'] = df.apply(
                lambda x : ''.join([e for e in x if isinstance(e, basestring)]),
                      axis=1)

For each row, extract string items and join them.

In [44]: df
Out[44]:
  ID_1 ID_2 ID_3 Combined_ID
0  abc  NaN  NaN         abc
1  NaN  def  NaN         def
2  NaN  NaN  ghi         ghi
3  NaN  NaN  jkl         jkl
4  NaN  mno  NaN         mno
5  pqr  NaN  NaN         pqr

I liked @EdChum's answer and looks more readable.

Interestingly, fillna('').sum(axis=1) method is expensive for this smaller data.

In [45]: %timeit df.fillna('').sum(axis=1)
1000 loops, best of 3: 808 µs per loop

In [46]: %timeit df.apply(lambda x : ''.join([e for e in x if isinstance(e, basestring)]), axis=1)
1000 loops, best of 3: 285 µs per loop

For, ['ID_1','ID_2','ID_3'] columns only

df[['ID_1','ID_2','ID_3']].apply(lambda_function)
Zero
  • 74,117
  • 18
  • 147
  • 154
  • That is not so surprising but `apply` does not scale well, I just did timings on a 600 row df and the timings were 6.24ms vs 33.3ms comparing my method against yours, I expect the performance difference to increase significantly on much larger datasets – EdChum Apr 21 '15 at 21:16
  • Is there a way to specify only the 3 columns I need? This is getting the other non-related columns included. – EMC Apr 21 '15 at 21:27
  • @EdChum Absolutely, `apply` doesn't perform well on larger datasets. So, I mentioned *expensive for this smaller data.* =) – Zero Apr 21 '15 at 21:32
  • 2
    A faster way might be to `.ffill()` or `.bfill()` along the rows, and take the last / first row's value. That'd handle multiple non-NaN values reasonably fast. – S Anand Apr 22 '15 at 02:38
  • @SAnand like `df.ffill(axis=1).iloc[:,-1]` – piroot Oct 04 '17 at 12:29
0

Another way:

df['Combined_ID'] = df.ID_1.fillna('') + df.ID_2.fillna('') + df.ID_3.fillna('')

Output:

  ID_1 ID_2 ID_3 Combined_ID
0  abc  NaN  NaN         abc
1  NaN  def  NaN         def
2  NaN  NaN  ghi         ghi
3  NaN  NaN  jkl         jkl
4  NaN  mno  NaN         mno
5  pqr  NaN  NaN         pqr
Joe
  • 12,057
  • 5
  • 39
  • 55