0

I'm doing some study on the below df

timestamp   conversationId   UserId  MessageId      tpMessage   Message 
1614578324  ceb9004ae9d3    1c376ef 5bbd34859329    question    Where do you live?
1614578881  ceb9004ae9d3    1c376ef d3b5d3884152    answer      Brooklyn
1614583764  ceb9004ae9d3    1c376ef 0e4501fcd61f    question    What's your name?
1614590885  ceb9004ae9d3    1c376ef 97d841b79ff7    answer      Phill
1614594952  ceb9004ae9d3    1c376ef 11ed3fd24767    question    What's your gender?
1614602036  ceb9004ae9d3    1c376ef 601538860004    answer      Male
1614602581  ceb9004ae9d3    1c376ef 8bc8d9089609    question    How old are you?
1614606219  ceb9004ae9d3    1c376ef a2bd45e64b7c    answer      35
1614606240  loi90zj8q0qv    1c890r9 o2bd10ex4b8u    question    Where do you live?
1614606240  jto9034pe0i5    1c489rl o6bd35e64b5j    question    What's your name?
1614606250  jto9034pe0i5    1c489rl 96jd89i55b72    answer      Robert
1614606267  jto9034pe0i5    1c489rl 33yd1445d6ut    answer      Brandom
1614606267  loi90zj8q0qv    1c890r9 o2bd10ex4b8u    answer      London
1614606287  jto9034pe0i5    1c489rl b7q489iae77t    answer      Connor

I need to "split" the timestamp column in 2 based on the tpMessage column, the contidions are:

df['ts_question'] = np.where(df['tpMessage']=='question', df['timestamp'],0)
df['ts_answer'] = np.where(df['tpMessage']=='answer', df['timestamp'],0)

this is giving me "0" values for both columns when the conditions don't match and I'm stuck in how to move forward after that

my goal is to get this output:

ts_question ts_answer   conversationId   UserId
1614578324  1614578881  ceb9004ae9d3    1c376ef
1614583764  1614590885  ceb9004ae9d3    1c376ef
1614594952  1614602036  ceb9004ae9d3    1c376ef
1614602581  1614606219  ceb9004ae9d3    1c376ef
1614606240  1614606250  jto9034pe0i5    1c489rl
1614606240  1614606267  o2bd10ex4b8u    1c890r9
1614606240  1614606267  o2bd10ex4b8u    1c489rl
1614606240  1614606287  jto9034pe0i5    1c489rl

note that I can have 1 or more answers for the question "What's your name"?

Edit : I found out that I can have N conversations happening at the same timestamp(i.e. 1614606240 and 1614606267)

could you guys help me on that

gfernandes
  • 193
  • 1
  • 10
  • You can use the apply function and pass it a lambda function which gets the row as an argument. See [here](https://stackoverflow.com/questions/26886653/create-new-column-based-on-values-from-other-columns-apply-a-function-of-multi) – Snake_py Feb 05 '23 at 14:40

1 Answers1

0

You can use merge:

# Assuming dataframe is already sorted by timestamp)
df['thread'] = df['tpMessage'].eq('question').cumsum()

# Split your data in two new dataframes: questions and answers
dfq = df[df['tpMessage'] == 'question'].rename(columns={'timestamp': 'ts_question'})
dfa = df[df['tpMessage'] == 'answer'].rename(columns={'timestamp': 'ts_answer'})

# Merge them on conversation, user id and thread
cols = ['ts_question', 'ts_answer', 'conversationId', 'UserId']
out = dfa.merge(dfq, on=['conversationId', 'UserId', 'thread'], how='outer')[cols]

Output:

>>> out
   ts_question   ts_answer conversationId   UserId
0   1614578324  1614578881   ceb9004ae9d3  1c376ef
1   1614583764  1614590885   ceb9004ae9d3  1c376ef
2   1614594952  1614602036   ceb9004ae9d3  1c376ef
3   1614602581  1614606219   ceb9004ae9d3  1c376ef
4   1614606240  1614606250   jto9034pe0i5  1c489rl
5   1614606240  1614606267   jto9034pe0i5  1c489rl
6   1614606240  1614606287   jto9034pe0i5  1c489rl
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • hey @Corralien, thanks for you reply, I tried your code, but I got the Nan values for ts_question(when the row was an answer) and Nan values for ts_answer(when the row was a question. the sort by timestamp you meant, I sorted using this: df2.sort_values(by='timestamp', ascending=True, inplace=True) is that what you were assuming? – gfernandes Feb 05 '23 at 17:10
  • Yes, this is what I meant. Did you split your dataframe into two parts when I did. – Corralien Feb 05 '23 at 17:43
  • yes, one for answers and other for questions. The df_thread column, it won't match for dfa and dfq, so, what is the idea about it? – gfernandes Feb 05 '23 at 17:55
  • Hey Corralien, I got the issue, I can have in my dataset N conversations happening at the same timestamp, I'm going to edit the question to provide this info to see if you can think ok something considering this new scenario – gfernandes Feb 05 '23 at 18:57