1

I have data like this:

   doc_id sentence_id token_id head_token_id
1    doc1           1        1             0
2    doc1           1        2             1
3    doc1           1        3             1
4    doc1           1        4             3
5    doc1           1        5             4
6    doc1           1        6             1
7    doc1           2        1             2
8    doc1           2        2             0
9    doc1           2        3             2
10   doc1           2        4             3
11   doc1           2        5             2
12   doc2           1        1             0
13   doc2           1        2             1
14   doc2           1        3             4
15   doc2           1        4             1

The data is grouped by columns "doc_id" and "sentence_id". The "head_token_id" column is an order column. However, the order values are not consecutive. For example, the values of "head_token_id" for doc_id == "doc_1" and sentence_id == 1 are 0, 1, 1, 3, 4, 1. I want to change them to consecutive values 0, 1, 1, 2, 3, 1. I want to do this within each group of "doc_id" and "sentence_id".

My desired output is like this with new_head_token_id column. Numbers 0 and 1 from head_token_id always be the same. But rest of numbers might be the same or not. In depends if earlier number exists in this sentence or not. For example

   doc_id sentence_id token_id head_token_id new_head_token_id
4    doc1           1        4             3                 2

here we see that 3 from head_token_id changed to 2, because in this sentence(sentence 1, doc 1) in column head_token_id there is no number 2. I try to deleted 'jumps of numbers'.

   doc_id sentence_id token_id head_token_id new_head_token_id
1    doc1           1        1             0                 0
2    doc1           1        2             1                 1
3    doc1           1        3             1                 1
4    doc1           1        4             3                 2
5    doc1           1        5             4                 3
6    doc1           1        6             1                 1
7    doc1           2        1             2                 1
8    doc1           2        2             0                 0
9    doc1           2        3             2                 1
10   doc1           2        4             3                 2
11   doc1           2        5             2                 1
12   doc2           1        1             0                 0
13   doc2           1        2             1                 1
14   doc2           1        3             4                 2
15   doc2           1        4             1                 1

I think the first part of code should be like this

for (i in unique(df$doc_id)){
  for(j in unique(df$sentence_id){
    for(k in df$token_id){
      if(df$head_token_id[k] == 0){df$new_head_token_id[k] = 0} else
        if(df$head_token_id[k] == 1){df$new_head_token_id[k] = 1}
    }
  }
}
little girl
  • 285
  • 1
  • 3
  • 15

2 Answers2

1

This relabeling is pretty easy to do treating the variable as a factor. We can then coerce it back to numeric. We use the fact that unique() will provide the vector of unique values in the order they occur.

The operation we want to do on a vector x is

as.numeric(as.character(
  factor(x, levels = unique(x), labels = seq_along(unique(x)) - 1)
))

This will relabel the unique values of x with the order in which they occur. The -1 makes it start from 0, not 1. And we coerce back to numeric. We'll make this into a function:

label0 = function(x) {
    as.numeric(as.character(
      factor(x, levels = unique(x), labels = seq_along(unique(x)) - 1)
    ))
}

Lastly, pick your favorite method of applying a function by a grouping variable. I'll use dplyr, but you can use data.table, base::ave, base::by, split; lapply; rbind, etc. Example of these methods and more can be found at the R-FAQ Sum a variable by group, you just want to use label0 instead of sum.

library(dplyr)
group_by(dat, doc_id, sentence_id) %>% mutate(new_head_token_id = label0(head_token_id))
# # A tibble: 15 x 5
# # Groups:   doc_id, sentence_id [3]
#    doc_id sentence_id token_id head_token_id new_head_token_id
#    <fctr>       <int>    <int>         <int>             <dbl>
#  1   doc1           1        1             0                 0
#  2   doc1           1        2             1                 1
#  3   doc1           1        3             1                 1
#  4   doc1           1        4             3                 2
#  5   doc1           1        5             4                 3
#  6   doc1           1        6             1                 1
#  7   doc1           2        1             2                 0
#  8   doc1           2        2             0                 1
#  9   doc1           2        3             2                 0
# 10   doc1           2        4             3                 2
# 11   doc1           2        5             2                 0
# 12   doc2           1        1             0                 0
# 13   doc2           1        2             1                 1
# 14   doc2           1        3             4                 2
# 15   doc2           1        4             1                 1

Using this data:

dat = read.table(text = "   doc_id sentence_id token_id head_token_id
1    doc1           1        1             0
2    doc1           1        2             1
3    doc1           1        3             1
4    doc1           1        4             3
5    doc1           1        5             4
6    doc1           1        6             1
7    doc1           2        1             2
8    doc1           2        2             0
9    doc1           2        3             2
10   doc1           2        4             3
11   doc1           2        5             2
12   doc2           1        1             0
13   doc2           1        2             1
14   doc2           1        3             4
15   doc2           1        4             1", head = T)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

I have answer for this. In one doc and one sentence we have to find how many unique values is lower than current checking value and replace current value on this number. For example, the values of "head_token_id" for doc_id == "doc_1" and sentence_id == 1 are 0, 1, 1, 3, 4, 1. For value 3 only TWO unique values are lower ( 0 and 1). So we want to change 3 to 2.

Code below:

levels<-function(parsedDataFrame)
{parsedDataFrame$head_token_id=as.numeric(parsedDataFrame$head_token_id)
for(doc in unique(parsedDataFrame[,1]))
    {for(prg in unique(parsedDataFrame[,2]))
        {for(stc in unique(parsedDataFrame[,3]))
            {
                newDataFrame=parsedDataFrame[which(parsedDataFrame[,1]==doc & parsedDataFrame[,2]==prg & parsedDataFrame[,3]==stc),]
                newDataFrame$sentenceLevel=sapply(newDataFrame$head_token_id,function(y) length(which(y>unique(newDataFrame$head_token_id))))

                if(exists("levelsDF"))
                  levelsDF=rbind(levelsDF,newDataFrame)
                else levelsDF=newDataFrame
            }
        }   
    }
    return(levelsDF)    
}
little girl
  • 285
  • 1
  • 3
  • 15