2

I have a dataframe with a bunch of Q&A sessions. Each time the speaker changes, the dataframe has a new row. I'm trying to assign question characteristics to the answers so I want to create an ID for each question-answer group. In the example below, I want to increment the id each time a new question is asked (speakertype_id == 3 => questions; speakertype_id == 4 => answers). I currently loop through the dataframe like so:

Q_A = pd.DataFrame({'qna_id':[9]*10,
                    'qnacomponentid':[3,4,5,6,7,8,9,10,11,12],
                    'speakertype_id':[3,4,3,4,4,4,3,4,3,4]})


group = [0]*len(Q_A)
j = 1
for index,row in enumerate(Q_A.itertuples()):
    if row[3] == 3: 
        j+=1
    group[index] = j

Q_A['group'] = group

This gives me the desired output and is much faster than I expected, but this post makes me question whether I should ever iterate over a pandas dataframe. Any thoughts on a better method? Thanks.

**Edit: Expected Output:

qna_id  qnacomponentid  speakertype_id  group
9   3   3   2
9   4   4   2
9   5   3   3
9   6   4   3
9   7   4   3
9   8   4   3
9   9   3   4
9   10  4   4
9   11  3   5
9   12  4   5
J4FFLE
  • 172
  • 10

2 Answers2

2

you can use eq and cumsum like:

Q_A['gr2'] = Q_A['speakertype_id'].eq(3).cumsum()
print(Q_A)
   qna_id  qnacomponentid  speakertype_id  group  gr2
0       9               3               3      2    1
1       9               4               4      2    1
2       9               5               3      3    2
3       9               6               4      3    2
4       9               7               4      3    2
5       9               8               4      3    2
6       9               9               3      4    3
7       9              10               4      4    3
8       9              11               3      5    4
9       9              12               4      5    4

Note that not sure if you have any reason to start at 2, but you can add +1 after the cumsum if it is a requirement

Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • 1
    This is perfect. I really have no reason to start at 2 except that is where it was when I got a working version that I liked so I kept it as it wasn't a big deal. The ID's don't even have to be numbers or consecutive. That just seemed the easiest approach. Thank you! – J4FFLE Jun 16 '21 at 17:38
2

i reproduced as per your output:

Q_A['cumsum'] = Q_A[Q_A.speakertype_id!=Q_A.speakertype_id.shift()].groupby('speakertype_id').cumcount()+2
Q_A['cumsum'] = Q_A['cumsum'].ffill().astype('int')
Suraj Shejal
  • 640
  • 3
  • 19
  • Thank you!! This is great if I don't want to increment when there are two consecutive questions with no answers. I believe in my case, I will want to increment so the unanswered question is left in a group by itself. The questions are highly moderated, so only one individual asks a question at a time. – J4FFLE Jun 16 '21 at 17:36