0

So, here is my first question posted on Stackover.

I quite often find myself having to deal with this question btw.

Here some code to extract the indices of eos tags:

eos_lst = concat_df.index[concat_df['Words'] == '<eos>'].tolist()
eos_dict = dict(enumerate(eos_lst))

{0: 19,
 1: 43,
 2: 66,
 3: 89,
 4: 109,
 5: 133,
 6: 155,
 7: 177,
 8: 200,
 9: 222,
 10: 242,
 11: 263,
 12: 284,
 13: 307,
 14: 330,
 15: 351,
 16: 374,
 17: 397,
 18: 421,
 19: 445,
 20: 467,
 21: 489,
 22: 515,
 23: 537} 

here is how my dataframe looks like:

    Surprisal    Words    
0   13.662818   mechanic     
1   6.629755    that         
2   2.837583    <unk>        
1   6.629755    that         
3   7.545498    hired        
4   7.283582    this         
5   5.102878    year         
6   15.049909   wondered     
7   2.247853    how          
8   11.838453   annoyed      
9   4.648082    with   
10  7.423959    himself
11  5.600996    for
12  8.367517    breaking
13  1.532452    the
14  5.288836    car
15  6.597129    the
16  11.746726   assistant
17  3.036219    would
18  5.600145    <unk>
19  17.821831   <eos>
20  6.084546    The
21  11.514463   plumber
22  5.786572    that
23  3.070432    <unk>
24  7.995253    hired
25  4.691128    on
26  10.046485   Monday
27  12.741153   considered
28  5.559496    how
29  11.221653   annoyed
30  5.540303    with
31  7.450393    himself
32  5.266942    for
33  10.060509   lying
34  3.438006    about
35  1.230741    the
36  12.262648   pipes
37  6.790797    the
38  6.592342    real
39  5.250412    estate
40  3.185315    agent
41  10.320124   probably
42  5.210815    <unk>
43  18.428978   <eos>
44  5.514643    The         
45  11.064813   fashion       
46  6.150403    model       
47  4.415222    that        
....    
66  18.096354   <eos>       
67  6.205741    The         
68  11.301127   carpenter   
69  6.170512    that        

Here are the possible expected outputs:
Personally, the first sample output would be much more convenient. However, I also attach the other one just because some other people might make use of it.

    Surprisal   Words   Sentence_id
0   13.662818   mechanic    0
1   6.629755    that        0
2   2.837583    <unk>       0
3   7.545498    hired       0
4   7.283582    this        0
5   5.102878    year        0
6   15.049909   wondered    0
7   2.247853    how         0
8   11.838453   annoyed     0
9   4.648082    with        0
10  7.423959    himself     0
11  5.600996    for         0
12  8.367517    breaking    0
13  1.532452    the         0
14  5.288836    car         0
15  6.597129    the         0
16  11.746726   assistant   0
17  3.036219    would       0
18  5.600145    <unk>       0
19  17.821831   <eos>       0
20  6.084546    The         1
21  11.514463   plumber     1
22  5.786572    that        1
23  3.070432    <unk>       1
24  7.995253    hired       1
25  4.691128    on          1
26  10.046485   Monday      1
27  12.741153  considered   1
28  5.559496    how         1
29  11.221653   annoyed     1
30  5.540303    with        1
31  7.450393    himself     1
32  5.266942    for         1  
33  10.060509   lying       1
34  3.438006    about       1
35  1.230741    the         1
36  12.262648   pipes       1
37  6.790797    the         1
38  6.592342    real        1
39  5.250412    estate      1
40  3.185315    agent       1
41  10.320124   probably    1
42  5.210815    <unk>       1
43  18.428978   <eos>       1
44  5.514643    The         2
45  11.064813   fashion     2  
46  6.150403    model       2
47  4.415222    that        2
....    
66  18.096354   <eos>       2
67  6.205741    The         3
68  11.301127   carpenter   3
69  6.170512    that        3





               0   13.662818   mechanic     
               1   6.629755    that         
Sentence_id_1  2   2.837583    <unk>        
               3   7.545498    hired        
               4   7.283582    this         
               5   5.102878    year
              ...
              19  17.821831   <eos>



                  20  6.084546    The
                  21  11.514463   plumber
                  22  5.786572    that
Sentence_id_2     23  3.070432    <unk>
                  24  7.995253    hired
                  25  4.691128    on
                  26  10.046485   Monday

So, what I want to do is to create a new column named sentence_id, where the values would be filled with the sentence ids (basically the keys of the eos_dict) until the ith row of a column. As you see, I already extracted the indices of the eos (i.e., end of sentence) tags (see, eos_dict). For example, I would like to fill the first 19 values of the new column with zeros. Then, for the next sentence id, I have to go to the 20th row and fill all the values with ones up until the 43th row. Then, from the 44th to all the way up to the 66th for twos and etc.

So this is what I have tried to do:

for k, v in eos_dict.items(): 
    concat_df['sentence_id'] = concat_df[:v].apply(lambda x: k, axis=1)

I know that this code doesn't work because I also have to increment the values of the dictionary by 1 so that the next sentence_id can start at the correct row. Plus, the values should be replaced with keys, while they are at the same time being incremented by 1.

I would appreciate if you come up with built-in pandas solutions.

h.nakata
  • 1
  • 1
  • kindly share data and not pics or links. share a small sample of ur data, as well as ur expected output. visuals add clarity to the words. data pls. [read this](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – sammywemmy Mar 30 '20 at 02:22
  • Thanks @sammywemmy, I added the sample and expected output to the question. – h.nakata Mar 30 '20 at 14:25

0 Answers0