I have data like this:
doc_id sentence_id token_id head_token_id
1 doc1 1 1 0
2 doc1 1 2 1
3 doc1 1 3 1
4 doc1 1 4 3
5 doc1 1 5 4
6 doc1 1 6 1
7 doc1 2 1 2
8 doc1 2 2 0
9 doc1 2 3 2
10 doc1 2 4 3
11 doc1 2 5 2
12 doc2 1 1 0
13 doc2 1 2 1
14 doc2 1 3 4
15 doc2 1 4 1
The data is grouped by columns "doc_id" and "sentence_id". The "head_token_id" column is an order column. However, the order values are not consecutive. For example, the values of "head_token_id" for doc_id == "doc_1"
and sentence_id == 1
are 0, 1, 1, 3, 4, 1
. I want to change them to consecutive values 0, 1, 1, 2, 3, 1
. I want to do this within each group of "doc_id" and "sentence_id".
My desired output is like this with new_head_token_id column. Numbers 0 and 1 from head_token_id always be the same. But rest of numbers might be the same or not. In depends if earlier number exists in this sentence or not. For example
doc_id sentence_id token_id head_token_id new_head_token_id
4 doc1 1 4 3 2
here we see that 3 from head_token_id changed to 2, because in this sentence(sentence 1, doc 1) in column head_token_id there is no number 2. I try to deleted 'jumps of numbers'.
doc_id sentence_id token_id head_token_id new_head_token_id
1 doc1 1 1 0 0
2 doc1 1 2 1 1
3 doc1 1 3 1 1
4 doc1 1 4 3 2
5 doc1 1 5 4 3
6 doc1 1 6 1 1
7 doc1 2 1 2 1
8 doc1 2 2 0 0
9 doc1 2 3 2 1
10 doc1 2 4 3 2
11 doc1 2 5 2 1
12 doc2 1 1 0 0
13 doc2 1 2 1 1
14 doc2 1 3 4 2
15 doc2 1 4 1 1
I think the first part of code should be like this
for (i in unique(df$doc_id)){
for(j in unique(df$sentence_id){
for(k in df$token_id){
if(df$head_token_id[k] == 0){df$new_head_token_id[k] = 0} else
if(df$head_token_id[k] == 1){df$new_head_token_id[k] = 1}
}
}
}