0

Let's suppose I have this dataframe:

Date        A   B
2010-01-01  NA  1
2010-01-02  2   NA
2010-01-05  3   NA
2010-01-07  NA  4
2010-01-20  5   NA
2010-01-25  6   7

I want to collapse rows, removing the NA values to the closest Date. So the result would be:

Date        A  B
2010-01-02  2  1
2010-01-07  3  4
2010-01-20  5  NA
2010-01-25  6  7

I saw this stack overflow that solves collapsing using a key value, but I could not find a similar case using close date values to collapse.

Obs1: It would be good if there was a way to not collapse the rows if the dates are too far apart (example: more than 15 days apart).

Obs2: It would be good if the collapsing lines kept the latter date rather than the earlier, as shown in the example above.

Daniel
  • 639
  • 8
  • 24
  • Do you expect to collapse more than 2 rows? – MKR May 11 '18 at 19:59
  • Only collapse where there is NA (so in the example above the answer would be no). If there were more columns it could be the case to collapse more than 2 rows. – Daniel May 11 '18 at 20:21
  • Yea. I was thinking on similar line and in solution I tried to collapse until both `A` and `B` got `non-NA` values. – MKR May 11 '18 at 20:23

1 Answers1

0

Using dplyr package an option could be to group_by on combination of A and B in such a way that they form complete values.

Considering Obs#2 the max of Date should be taken for combined row.

library(dplyr)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
  mutate(GrpA = cumsum(!is.na(A)), GrpB = cumsum(!is.na(B))) %>%
  rowwise() %>%
  mutate(Grp = max(GrpA, GrpB)) %>%
  ungroup() %>%
  select(-GrpA, -GrpB) %>%
  group_by(Grp) %>%
  summarise(Date = max(Date), A = A[!is.na(A)][1], B = B[!is.na(B)][1])

# # A tibble: 4 x 4
# Grp Date           A     B
# <int> <date>     <int> <int>
# 1     1 2010-01-02     2     1
# 2     2 2010-01-07     3     4
# 3     3 2010-01-20     5    NA
# 4     4 2010-01-25     6     7

Data:

df <- read.table(text = 
"Date        A   B
2010-01-01  NA  1
2010-01-02  2   NA
2010-01-05  3   NA
2010-01-07  NA  4
2010-01-20  5   NA
2010-01-25  6   7",
stringsAsFactors = FALSE, header = TRUE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • This does not work. For example, using the following df, the line where B = 5 disappears: df <- read.table(text = "Date A B 2010-01-01 NA 1 2010-01-02 2 NA 2010-01-05 3 NA 2010-01-06 8 NA 2010-01-07 NA 4 2010-01-08 NA 5 2010-01-20 5 NA", stringsAsFactors = FALSE, header = TRUE) – Daniel May 11 '18 at 20:35
  • @Daniel That's true. Its not considering that case. Let me re-visit it. – MKR May 11 '18 at 20:43