1

I have a dataset with irregular dates column. I want to create an index column. Index ID (e.g. 1) is same for three dissimilar consecutive dates then changes (e.g. to 2) for next three dissimilar consecutive dates and so on. Here is a sample of dates and how the desired column shall look like:

structure(list(Date = c(42370, 42371, 42371, 42371, 42372, 42372, 
42375, 42375, 42375, 42377, 42377, 42383, 42383, 42385, 42386, 
42386, 42386, 42393, 42393, 42394, 42394, 42395, 42398, 42398, 
42398, 42398), Index = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4)), .Names = c("Date", 
"Index"), row.names = c(NA, 26L), class = "data.frame")
Jaap
  • 81,064
  • 34
  • 182
  • 193
Polar Bear
  • 731
  • 1
  • 7
  • 21
  • The question might look odd but it's very important for my project. – Polar Bear May 13 '16 at 19:50
  • Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610). This will make it much easier for others to help you. – Jaap May 13 '16 at 19:51
  • Is it fine now? I have attached a file with dates column and desired column( Index). As for the code, I have no clue. – Polar Bear May 13 '16 at 19:55
  • 2
    It is better not to link to an external file, but instead include the output of `dput(yourdataframe)`, see also the second link I provided earlier. – Jaap May 13 '16 at 19:56

4 Answers4

4

Using rleid from the data.table package and cumsum:

library(data.table)
setDT(d1)[, index := (rleid(Date)-1) %% 3
          ][, index := cumsum(index < shift(index, fill=1))][]

gives:

        Date index
 1: 01-01-16     1
 2: 02-01-16     1
 3: 02-01-16     1
 4: 02-01-16     1
 5: 03-01-16     1
 6: 03-01-16     1
 7: 06-01-16     2
 8: 06-01-16     2
 9: 06-01-16     2
10: 08-01-16     2
11: 08-01-16     2
12: 14-01-16     2
13: 14-01-16     2
14: 16-01-16     3
15: 17-01-16     3
16: 17-01-16     3
17: 17-01-16     3
18: 24-01-16     3
19: 24-01-16     3
20: 25-01-16     4
21: 25-01-16     4
22: 26-01-16     4
23: 29-01-16     4
24: 29-01-16     4
25: 29-01-16     4
26: 29-01-16     4

Explanation:

  • The rleid function creates a runlength id. This means that every time Date changes, the runlength id is increased by 1.
  • By substracting 1 from the runlength id and taking the modulus of it (the %% 3 part) you get a vector of sequences of 0,1&2's.
  • As last step you take the cumulative sum of the comparison of the values with the previous values. When index < shift(index, fill=1) is TRUE, the cumsum function will count that as a one.

In order to better see what this code does, see the output of the following code which creates a variable for each step:

setDT(d1)[, index1 := (rleid(Date)-1) %% 3
          ][, index2 := cumsum(index1 < shift(index1, fill=1))][]

Used data:

d1 <- structure(list(Date = structure(c(16801, 16802, 16802, 16802, 16803, 16803, 16806, 
                                        16806, 16806, 16808, 16808, 16814, 16814, 16816, 
                                        16817, 16817, 16817, 16824, 16824, 16825, 16825, 
                                        16826, 16829, 16829, 16829, 16829), class = "Date")), 
                .Names = "Date", row.names = c(NA, 26L), class = "data.frame")
Jaap
  • 81,064
  • 34
  • 182
  • 193
3

This constructs a grouped by 3 index for the unique values of Date and then uses character names to manage a lookup table for the conversion:

 fac <- ((seq(length(unique(dat$Date)))-1) %/%3) +1
 names(fac) <- unique(dat$Date)

 dat$myIndex <- fac[as.character(dat$Date)]
 dat
#-------
    Date Index myIndex
1  42370     1       1
2  42371     1       1
3  42371     1       1
4  42371     1       1
5  42372     1       1
6  42372     1       1
7  42375     2       2
8  42375     2       2
9  42375     2       2
10 42377     2       2
11 42377     2       2
12 42383     2       2
13 42383     2       2
14 42385     3       3
15 42386     3       3
16 42386     3       3
17 42386     3       3
18 42393     3       3
19 42393     3       3
20 42394     4       4
21 42394     4       4
22 42395     4       4
23 42398     4       4
24 42398     4       4
25 42398     4       4
26 42398     4       4
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I shift the integer sequence so it is zero-based rather than 1-based and then use modulo division `%/%` and then add 1 back to that result to make the grouping vector start at 1. I guess I could have added 2 rather than subtracting 1 and then I wouldn't have needed the second step. – IRTFM May 13 '16 at 20:44
  • what this step is doing : names(fac) <- unique(dat$Date) ? – Polar Bear May 13 '16 at 20:51
  • 1
    It's providing labels that can be matched against the values of `as.character(dat$Date)`. Using named vectors as lookup "tables" is a standard R paradigm. (Although they are not true R `tables`.) Probably could have done it with `match` as well. – IRTFM May 14 '16 at 00:31
  • How I should proceed if I were to go reverse i.e. I have the Index values and repeat it for a certain number of dates(say 3). The problem is described in detail here: http://stackoverflow.com/questions/37293841/r-matching-aggregate-values-with-daily-values-by-repetition – Polar Bear May 18 '16 at 10:07
3

base R. We can modify the rle (run-length encoding) of the object to group trios of values:

DF$index = with(rle(DF$Date), {
  g = ceiling(seq_along(values)/3)
  split(values, g) <- seq(tail(g,1))
  inverse.rle(list(lengths = lengths, values = values))
})

The weird split(x,g) <- bit was borrowed from ave. If the Date column is increasing, this can be done more simply (thanks to @Jaap):

DF$index = ceiling(match(DF$Date, unique(DF$Date))/3) # or...
DF$index = ceiling(as.integer(factor(DF$Date))/3)

data.table. The data.table analogue is simpler:

library(data.table)
setDT(DF)[, index := ceiling(rleid(Date)/3)]
Frank
  • 66,179
  • 8
  • 96
  • 180
2

I used data from an earlier version of the question:

 df <- data.frame(Date = c("01-01-16", "02-01-16", "02-01-16", "02-01-16", 
                        "03-01-16", "03-01-16", "06-01-16", "06-01-16", "06-01-16", "08-01-16", 
                        "08-01-16", "14-01-16", "14-01-16", "16-01-16", "17-01-16", "17-01-16", 
                        "17-01-16", "24-01-16", "24-01-16", "25-01-16", "25-01-16", "26-01-16", 
                        "29-01-16", "29-01-16", "29-01-16", "29-01-16"), 
                    Index = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
                        3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L))

I would start by converting the Date column from character to date, and making sure the data frame is sorted by date (you don't need that part with the new version of the data where Date is already numeric, and if you are sure the data frame is already sorted by Date):

df$Date <- as.Date(df$Date, format="%d-%m-%y")
df <- df[ order(df$Date),]

Then I would convert the date to consecutive integers - one way to do it is to convert to factor and then unclass (here I used c as a shorthand to do it) - and then cut it at equal intervals:

df$ndx <- c(factor(as.numeric(df$Date)))
df$ndx <- cut(df$ndx, seq(0.5, max(df$ndx)+0.5, by=3), labels=FALSE)
lebatsnok
  • 6,329
  • 2
  • 21
  • 22