Change non-consecutive indices to consecutive

Question

I have data like this:

   doc_id sentence_id token_id head_token_id
1    doc1           1        1             0
2    doc1           1        2             1
3    doc1           1        3             1
4    doc1           1        4             3
5    doc1           1        5             4
6    doc1           1        6             1
7    doc1           2        1             2
8    doc1           2        2             0
9    doc1           2        3             2
10   doc1           2        4             3
11   doc1           2        5             2
12   doc2           1        1             0
13   doc2           1        2             1
14   doc2           1        3             4
15   doc2           1        4             1

The data is grouped by columns "doc_id" and "sentence_id". The "head_token_id" column is an order column. However, the order values are not consecutive. For example, the values of "head_token_id" for doc_id == "doc_1" and sentence_id == 1 are 0, 1, 1, 3, 4, 1. I want to change them to consecutive values 0, 1, 1, 2, 3, 1. I want to do this within each group of "doc_id" and "sentence_id".

My desired output is like this with new_head_token_id column. Numbers 0 and 1 from head_token_id always be the same. But rest of numbers might be the same or not. In depends if earlier number exists in this sentence or not. For example

   doc_id sentence_id token_id head_token_id new_head_token_id
4    doc1           1        4             3                 2

here we see that 3 from head_token_id changed to 2, because in this sentence(sentence 1, doc 1) in column head_token_id there is no number 2. I try to deleted 'jumps of numbers'.

   doc_id sentence_id token_id head_token_id new_head_token_id
1    doc1           1        1             0                 0
2    doc1           1        2             1                 1
3    doc1           1        3             1                 1
4    doc1           1        4             3                 2
5    doc1           1        5             4                 3
6    doc1           1        6             1                 1
7    doc1           2        1             2                 1
8    doc1           2        2             0                 0
9    doc1           2        3             2                 1
10   doc1           2        4             3                 2
11   doc1           2        5             2                 1
12   doc2           1        1             0                 0
13   doc2           1        2             1                 1
14   doc2           1        3             4                 2
15   doc2           1        4             1                 1

I think the first part of code should be like this

for (i in unique(df$doc_id)){
  for(j in unique(df$sentence_id){
    for(k in df$token_id){
      if(df$head_token_id[k] == 0){df$new_head_token_id[k] = 0} else
        if(df$head_token_id[k] == 1){df$new_head_token_id[k] = 1}
    }
  }
}

What is the algorithm for this? If you just want to change the order, pass DF a vector of row numbers to reflect this. — Roman Luštrik, Oct 22 '17 at 12:38
I've tried to figure out the algorithm for this, because in real I have very big data set with 100 000 documents — little girl, Oct 22 '17 at 12:44
What I don't understand is, the seventh number in `head_token_id` is a 2 though? — jay.sf, Oct 22 '17 at 12:54
yes the seventh number in head_token_id is a 2. seventh number is first word in second sentence in doc1. — little girl, Oct 22 '17 at 14:09

Gregor Thomas · Answer 1 · 2017-10-22T20:59:59.397

This relabeling is pretty easy to do treating the variable as a factor. We can then coerce it back to numeric. We use the fact that unique() will provide the vector of unique values in the order they occur.

The operation we want to do on a vector x is

as.numeric(as.character(
  factor(x, levels = unique(x), labels = seq_along(unique(x)) - 1)
))

This will relabel the unique values of x with the order in which they occur. The -1 makes it start from 0, not 1. And we coerce back to numeric. We'll make this into a function:

label0 = function(x) {
    as.numeric(as.character(
      factor(x, levels = unique(x), labels = seq_along(unique(x)) - 1)
    ))
}

Lastly, pick your favorite method of applying a function by a grouping variable. I'll use dplyr, but you can use data.table, base::ave, base::by, split; lapply; rbind, etc. Example of these methods and more can be found at the R-FAQ Sum a variable by group, you just want to use label0 instead of sum.

library(dplyr)
group_by(dat, doc_id, sentence_id) %>% mutate(new_head_token_id = label0(head_token_id))
# # A tibble: 15 x 5
# # Groups:   doc_id, sentence_id [3]
#    doc_id sentence_id token_id head_token_id new_head_token_id
#    <fctr>       <int>    <int>         <int>             <dbl>
#  1   doc1           1        1             0                 0
#  2   doc1           1        2             1                 1
#  3   doc1           1        3             1                 1
#  4   doc1           1        4             3                 2
#  5   doc1           1        5             4                 3
#  6   doc1           1        6             1                 1
#  7   doc1           2        1             2                 0
#  8   doc1           2        2             0                 1
#  9   doc1           2        3             2                 0
# 10   doc1           2        4             3                 2
# 11   doc1           2        5             2                 0
# 12   doc2           1        1             0                 0
# 13   doc2           1        2             1                 1
# 14   doc2           1        3             4                 2
# 15   doc2           1        4             1                 1

Using this data:

dat = read.table(text = "   doc_id sentence_id token_id head_token_id
1    doc1           1        1             0
2    doc1           1        2             1
3    doc1           1        3             1
4    doc1           1        4             3
5    doc1           1        5             4
6    doc1           1        6             1
7    doc1           2        1             2
8    doc1           2        2             0
9    doc1           2        3             2
10   doc1           2        4             3
11   doc1           2        5             2
12   doc2           1        1             0
13   doc2           1        2             1
14   doc2           1        3             4
15   doc2           1        4             1", head = T)

doc1,sentence2 and doc2,sentence1 are wrong in new_head_token_id column — little girl, Oct 22 '17 at 16:02
When I used your code on my computer only doc1,sentence2 is wrong. — little girl, Oct 22 '17 at 16:10
Sorry, didn't realize you need grouping by sentence as well. Just add that to the group by: `group_by(dat, doc_id, sentence_id)`, the rest is the same. Answer edited. — Gregor Thomas, Oct 22 '17 at 20:57
still it doesn't work. Look at doc1 sentence2.. 0 and 1 should stay the same always — little girl, Oct 23 '17 at 05:36

score 1 · Accepted Answer · answered Oct 24 '17 at 08:39

I have answer for this. In one doc and one sentence we have to find how many unique values is lower than current checking value and replace current value on this number. For example, the values of "head_token_id" for doc_id == "doc_1" and sentence_id == 1 are 0, 1, 1, 3, 4, 1. For value 3 only TWO unique values are lower ( 0 and 1). So we want to change 3 to 2.

Code below:

levels<-function(parsedDataFrame)
{parsedDataFrame$head_token_id=as.numeric(parsedDataFrame$head_token_id)
for(doc in unique(parsedDataFrame[,1]))
    {for(prg in unique(parsedDataFrame[,2]))
        {for(stc in unique(parsedDataFrame[,3]))
            {
                newDataFrame=parsedDataFrame[which(parsedDataFrame[,1]==doc & parsedDataFrame[,2]==prg & parsedDataFrame[,3]==stc),]
                newDataFrame$sentenceLevel=sapply(newDataFrame$head_token_id,function(y) length(which(y>unique(newDataFrame$head_token_id))))

                if(exists("levelsDF"))
                  levelsDF=rbind(levelsDF,newDataFrame)
                else levelsDF=newDataFrame
            }
        }   
    }
    return(levelsDF)    
}

Change non-consecutive indices to consecutive

2 Answers2