So, here is my first question posted on Stackover.
I quite often find myself having to deal with this question btw.
Here some code to extract the indices of eos tags:
eos_lst = concat_df.index[concat_df['Words'] == '<eos>'].tolist()
eos_dict = dict(enumerate(eos_lst))
{0: 19,
1: 43,
2: 66,
3: 89,
4: 109,
5: 133,
6: 155,
7: 177,
8: 200,
9: 222,
10: 242,
11: 263,
12: 284,
13: 307,
14: 330,
15: 351,
16: 374,
17: 397,
18: 421,
19: 445,
20: 467,
21: 489,
22: 515,
23: 537}
here is how my dataframe looks like:
Surprisal Words
0 13.662818 mechanic
1 6.629755 that
2 2.837583 <unk>
1 6.629755 that
3 7.545498 hired
4 7.283582 this
5 5.102878 year
6 15.049909 wondered
7 2.247853 how
8 11.838453 annoyed
9 4.648082 with
10 7.423959 himself
11 5.600996 for
12 8.367517 breaking
13 1.532452 the
14 5.288836 car
15 6.597129 the
16 11.746726 assistant
17 3.036219 would
18 5.600145 <unk>
19 17.821831 <eos>
20 6.084546 The
21 11.514463 plumber
22 5.786572 that
23 3.070432 <unk>
24 7.995253 hired
25 4.691128 on
26 10.046485 Monday
27 12.741153 considered
28 5.559496 how
29 11.221653 annoyed
30 5.540303 with
31 7.450393 himself
32 5.266942 for
33 10.060509 lying
34 3.438006 about
35 1.230741 the
36 12.262648 pipes
37 6.790797 the
38 6.592342 real
39 5.250412 estate
40 3.185315 agent
41 10.320124 probably
42 5.210815 <unk>
43 18.428978 <eos>
44 5.514643 The
45 11.064813 fashion
46 6.150403 model
47 4.415222 that
....
66 18.096354 <eos>
67 6.205741 The
68 11.301127 carpenter
69 6.170512 that
Here are the possible expected outputs:
Personally, the first sample output would be much more convenient. However, I also attach the other one just because some other people might make use of it.
Surprisal Words Sentence_id
0 13.662818 mechanic 0
1 6.629755 that 0
2 2.837583 <unk> 0
3 7.545498 hired 0
4 7.283582 this 0
5 5.102878 year 0
6 15.049909 wondered 0
7 2.247853 how 0
8 11.838453 annoyed 0
9 4.648082 with 0
10 7.423959 himself 0
11 5.600996 for 0
12 8.367517 breaking 0
13 1.532452 the 0
14 5.288836 car 0
15 6.597129 the 0
16 11.746726 assistant 0
17 3.036219 would 0
18 5.600145 <unk> 0
19 17.821831 <eos> 0
20 6.084546 The 1
21 11.514463 plumber 1
22 5.786572 that 1
23 3.070432 <unk> 1
24 7.995253 hired 1
25 4.691128 on 1
26 10.046485 Monday 1
27 12.741153 considered 1
28 5.559496 how 1
29 11.221653 annoyed 1
30 5.540303 with 1
31 7.450393 himself 1
32 5.266942 for 1
33 10.060509 lying 1
34 3.438006 about 1
35 1.230741 the 1
36 12.262648 pipes 1
37 6.790797 the 1
38 6.592342 real 1
39 5.250412 estate 1
40 3.185315 agent 1
41 10.320124 probably 1
42 5.210815 <unk> 1
43 18.428978 <eos> 1
44 5.514643 The 2
45 11.064813 fashion 2
46 6.150403 model 2
47 4.415222 that 2
....
66 18.096354 <eos> 2
67 6.205741 The 3
68 11.301127 carpenter 3
69 6.170512 that 3
0 13.662818 mechanic
1 6.629755 that
Sentence_id_1 2 2.837583 <unk>
3 7.545498 hired
4 7.283582 this
5 5.102878 year
...
19 17.821831 <eos>
20 6.084546 The
21 11.514463 plumber
22 5.786572 that
Sentence_id_2 23 3.070432 <unk>
24 7.995253 hired
25 4.691128 on
26 10.046485 Monday
So, what I want to do is to create a new column named sentence_id, where the values would be filled with the sentence ids (basically the keys of the eos_dict) until the ith row of a column. As you see, I already extracted the indices of the eos (i.e., end of sentence) tags (see, eos_dict). For example, I would like to fill the first 19 values of the new column with zeros. Then, for the next sentence id, I have to go to the 20th row and fill all the values with ones up until the 43th row. Then, from the 44th to all the way up to the 66th for twos and etc.
So this is what I have tried to do:
for k, v in eos_dict.items():
concat_df['sentence_id'] = concat_df[:v].apply(lambda x: k, axis=1)
I know that this code doesn't work because I also have to increment the values of the dictionary by 1 so that the next sentence_id can start at the correct row. Plus, the values should be replaced with keys, while they are at the same time being incremented by 1.
I would appreciate if you come up with built-in pandas solutions.