1

Please be advised, I am a beginning programmer and a beginning python/pandas user. I'm a behavioral scientist and learning to use pandas to process and organize my data. As a result, some of this might seem completely obvious and it may seem like a question not worthy of the forum. Please have tolerance! To me, this is days of work, and I have indeed spent hours trying to figure out the answer to this question already. Thanks in advance for any help.

My data look like this. The "real" Actor and Recipient data are always 5-digit numbers, and the "Behavior" data are always letter codes. My problem is that I also use this format for special lines, denoted by markers like "date" or "s" in the Actor column. These markers indicate that the "Behavior" column holds this special type of data, and not actual Behavior data. So, I want to replace the markers in the Actor column with NaN values, and grab the special data from the behavior column to put in another column (in this example, the empty Activity column).

    follow    Activity    Actor    Behavior    Recipient1
0   1         NaN         date     2.1.3.2012  NaN
1   1         NaN         s        ss.hx       NaN
2   1         NaN         50505    vo          51608
3   1         NaN         51608    vr          50505
4   1         NaN         s        ss.he       NaN

So far, I have written some code in pandas to select out the "s" lines into a new dataframe:

def get_act_line(group):
    return group.ix[(group.Actor == 's')]

result = trimdata.groupby('follow').apply(get_act_line)

I've copied over the Behavior column in this dataframe to the Activity column, and replaced the Actor and Behavior values with NaN:

result.Activity = result.Behavior
result.Behavior = np.nan
result.Actor = np.nan
result.head()

So my new dataframe looks like this:

follow         follow    Activity    Actor    Behavior    Recipient1
1        2     1         ss.hx       NaN      NaN         NaN
         34    1         hf.xa       NaN      NaN         f.53702
         74    1         hf.fe       NaN      NaN         NaN
10       1287  10        ss.hf       NaN      NaN         db
         1335  10        fe          NaN      NaN         db

What I would like to do now is to combine this dataframe with the original, replacing all of the values in these selected rows, but maintaining values for the other rows in the original dataframe.

This may seem like a simple question with an obvious solution, or perhaps I have gone about it all wrong to begin with!

I've worked through Wes McKinney's book, I've read the documentation on different types of merges, mapping, joining, transformations, concatenations, etc. I have browsed the forums and have not found an answer that helps me to figure this out. Your help will be very much appreciated.

M.A.Kline
  • 1,697
  • 2
  • 19
  • 29

1 Answers1

2

One way you can do this (though there may be more optimal or elegant ways) is:

mask = (df['Actor']=='s')
df['Activity'] = df[mask]['Behavior']
df.ix[mask, 'Behavior'] = np.nan 

where df is equivalent to your results dataframe. This should return (my column orders are slightly different):

  Activity  Actor             Behavior  Recipient1  follow
0      NaN   date  2013-04-01 00:00:00          NaN       1
1    ss.hx    NaN                ss.hx          NaN       1
2      NaN  50505                   vo        51608       1
3      NaN  51608                   vr        50505       1
4    ss.he    NaN                ss.hx          NaN       1

References:

  • Explanation of df.ix from other STO post.
Community
  • 1
  • 1
mlimb
  • 2,083
  • 2
  • 14
  • 9