4

I am trying to take the the following dataset and convert it to the second. Basically, I'm trying to fill in the NA's between each ID with that ID.

Each ID corresponds to two time stamps, which I've joined in to a larger date_time column. For purposes of reproduce-ability, it's too computationally expensive to do a sql between join (the date_time column is very large) or even to take the original dataset and create the time stamps between each id and then join it in (I have too many ID's to do this). I've successfully completed both those methods and it just takes too much time for the amount of data I have. I'm hoping to manipulate the data with this data set. It seems like such a simple thing, but it's really got me stumped. Any help would be appreciated.

Current data set:

             date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     NA
 5 2017-01-30 08:00:04     NA
 6 2017-01-30 08:00:05     NA
 7 2017-01-30 08:00:06     NA
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     NA
13 2017-01-30 08:00:12     NA
14 2017-01-30 08:00:13     NA
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     NA
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

Desired dataset:

            date_time     id
                <dttm>  <dbl>
 1 2017-01-30 08:00:00     NA
 2 2017-01-30 08:00:01     NA
 3 2017-01-30 08:00:02     1
 4 2017-01-30 08:00:03     1
 5 2017-01-30 08:00:04     1
 6 2017-01-30 08:00:05     1
 7 2017-01-30 08:00:06     1
 8 2017-01-30 08:00:07     1
 9 2017-01-30 08:00:08     NA
10 2017-01-30 08:00:09     NA
11 2017-01-30 08:00:10     2
12 2017-01-30 08:00:11     2
13 2017-01-30 08:00:12     2
14 2017-01-30 08:00:13     2
15 2017-01-30 08:00:14     2
16 2017-01-30 08:00:15     NA
17 2017-01-30 08:00:16     3
18 2017-01-30 08:00:17     3
19 2017-01-30 08:00:18     3
20 2017-01-30 08:00:19     NA

dput() date:

structure(list(date_time = structure(c(1485781200, 1485781201, 
1485781202, 1485781203, 1485781204, 1485781205, 1485781206, 1485781207, 
1485781208, 1485781209, 1485781210, 1485781211, 1485781212, 1485781213, 
1485781214, 1485781215, 1485781216, 1485781217, 1485781218, 1485781219
), class = c("POSIXct", "POSIXt"), tzone = ""), trx_id = c(NA_real_, 
NA_real_, 1, NA_real_, NA_real_, NA_real_, NA_real_, 1, 
NA_real_, NA_real_, 2, NA_real_, NA_real_, NA_real_, 2, 
NA_real_, 3, NA_real_, 3, NA_real_)), .Names = c("date_time", 
"trx_id"), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))
jstauss
  • 167
  • 1
  • 10
  • could you use `dput` so that we can copy the example dataframe? – Pdubbs Feb 17 '18 at 15:47
  • 3
    Use function `na.locf` from package `zoo`. – Rui Barradas Feb 17 '18 at 15:53
  • 1
    Possible duplicate of [Replacing NAs with latest non-NA value](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) – Rui Barradas Feb 17 '18 at 15:55
  • @Pdubbs I added the output from dput(). I'm new to including reproducible examples, so let me know if what I did wasn't right. – jstauss Feb 17 '18 at 16:01
  • @RuiBarradas I tried na.locf, but it fills the NA's after the ID's (so if there was a `c(3, NA, NA, 3, NA, NA, 4)` it would convert it to `c(3, 3, 3, 3, 3, 3, 4)`, which I don't want). I want the data to be `c(3, 3, 3, 3, NA, NA, 4)`. – jstauss Feb 17 '18 at 16:05
  • OK, I will retract my close as dupe vote. I had misunderstood the problem. – Rui Barradas Feb 17 '18 at 16:25

3 Answers3

9

One solution could be using fill function from tidyr. The approach is simple. First create 2 columns one each for prev and next values. Use fill to populate missing values in both columns.

Now, for rows which has same value in both prev_val and next_val the value should be updated with prev_val (that means those missing values are between same number)

df <-  read.table(text = "sl date_time, value
1 '2017-01-30 08:00:00'     NA
2 '2017-01-30 08:00:01'     NA
3 '2017-01-30 08:00:02'     1
4 '2017-01-30 08:00:03'     NA
5 '2017-01-30 08:00:04'     NA
6 '2017-01-30 08:00:05'     NA
7 '2017-01-30 08:00:06'     NA
8 '2017-01-30 08:00:07'     1
9 '2017-01-30 08:00:08'     NA
10 '2017-01-30 08:00:09'     NA
11 '2017-01-30 08:00:10'     2
12 '2017-01-30 08:00:11'     NA
13 '2017-01-30 08:00:12'     NA
14 '2017-01-30 08:00:13'     NA
15 '2017-01-30 08:00:14'     2
16 '2017-01-30 08:00:15'     NA
17 '2017-01-30 08:00:16'     3
18 '2017-01-30 08:00:17'     NA
19 '2017-01-30 08:00:18'     3
20 '2017-01-30 08:00:19'     NA", header = T, stringsAsFactor = F)

#use fill to find missing values
df %>%
  mutate(prev_val = (value), next_val = (value)) %>%
  fill(prev_val, .direction = "down") %>%
  fill(next_val, .direction = "up") %>%
  mutate(value = ifelse(prev_val == next_val, prev_val, value )) %>%
  select(-prev_val, -next_val)

Result:
   sl          date_time. value
1   1 2017-01-30 08:00:00    NA
2   2 2017-01-30 08:00:01    NA
3   3 2017-01-30 08:00:02     1
4   4 2017-01-30 08:00:03     1
5   5 2017-01-30 08:00:04     1
6   6 2017-01-30 08:00:05     1
7   7 2017-01-30 08:00:06     1
8   8 2017-01-30 08:00:07     1
9   9 2017-01-30 08:00:08    NA
10 10 2017-01-30 08:00:09    NA
11 11 2017-01-30 08:00:10     2
12 12 2017-01-30 08:00:11     2
13 13 2017-01-30 08:00:12     2
14 14 2017-01-30 08:00:13     2
15 15 2017-01-30 08:00:14     2
16 16 2017-01-30 08:00:15    NA
17 17 2017-01-30 08:00:16     3
18 18 2017-01-30 08:00:17     3
19 19 2017-01-30 08:00:18     3
20 20 2017-01-30 08:00:19    NA
MKR
  • 19,739
  • 4
  • 23
  • 33
4

Here is a base R option. We split the sequence of rows of the dataset with 'trx_id' (the one OP showed as input data), get the sequence (seq), stack it to two column dataset and assign the 'trx_id' to 'ind' column of 'd1' based on the 'values' as index from 'd1'

d1 <- stack(lapply(split(seq_len(nrow(df1)), df1$trx_id), function(x) seq(x[1], x[2])))
df1$trx_id[d1$values] <- d1$ind
df1$trx_id
#[1] NA NA  1  1  1  1  1  1 NA NA  2  2  2  2  2 NA  3  3  3 NA
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I'm getting an error when I try this on my full data set. It says `Error in seq.default(x[1], x[2]) : 'to' must be a finite number`. I'm working to figure it out, but it does work on the example code I've given. – jstauss Feb 19 '18 at 13:23
  • @jstauss That's interesting. Can you check the output of `split(seq_len(nrow(df1)), df1$trx_id)` – akrun Feb 19 '18 at 13:43
4

A non tidyr approach, where x is your id column:

x <- c(NA,NA, 1,NA,NA,1, NA, NA, 2, NA, NA,2, NA, 3,NA, NA,3,NA)

timestamps <- paste(unique(x)[!is.na(unique(x))], collapse = "|")

timestamps <- grep(timestamps, x)
timestamps <- matrix(timestamps, ncol = 2, byrow=TRUE)

xmatrix <- apply(timestamps, MARGIN = 1, FUN = function(i) {
  y <- x[i[1]:i[2]]
  y[is.na(y)] <- x[i][1]
  x[i[1]:i[2]] <- y 
  return(x)
})

(x <- apply(xmatrix, 1,FUN = function(z) {

  ifelse(all(is.na(z)), NA, max(z, na.rm=TRUE))
  }))

##  [1] NA NA  1  1  1  1 NA NA  2  2  2  2 NA  3  3  3  3 NA

HTH