How do I deduplicate but control for a condition in R?

Question

I have the a dataframe with two columns, yearmon and x. There are multiple entries for yearmon. Some have corresponding info in the x column and others have NA.

If, for example, one of the January 2001 entries has data in the x column I want to keep that row but remove other rows for Jan 2001 that have NA in them. However, if none of the January 2001 entries have data in the x column, I want to keep the NA column. I tried deduplicate, unique, merge, aggregate and have had no luck.

Here is a sample of my data as requested:

      Yearmon     Price
1   2001-01-01        NA
2   2001-01-01  188000.0
3   2001-02-01        NA
4   2001-02-01  250000.0
5   2001-03-01        NA
6   2001-03-01  250000.0
7   2001-04-01        NA
8   2001-05-01        NA
9   2001-05-01  325000.0
10  2001-06-01        NA
11  2001-06-01  375000.0
12  2001-07-01        NA
13  2001-08-01        NA
14  2001-09-01        NA
15  2001-10-01        NA
16  2001-11-01        NA
17  2001-11-01  305000.0
18  2001-12-01        NA
19  2001-12-01  360000.0

If you're unsure how to do that: read [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Heroka, Nov 19 '15 at 21:13

score 0 · Answer 1 · answered Nov 19 '15 at 23:42

Because you have your data sorted with NAs at the beginning of each group, you should be able to use run-length-encoding to calculate the indices of the rows you wish to keep:

> df[cumsum(rle(as.integer(df$Yearmon))$lengths),]
      Yearmon  Price
2  2001-01-01 188000
4  2001-02-01 250000
6  2001-03-01 250000
7  2001-04-01     NA
9  2001-05-01 325000
11 2001-06-01 375000
12 2001-07-01     NA
13 2001-08-01     NA
14 2001-09-01     NA
15 2001-10-01     NA
17 2001-11-01 305000
19 2001-12-01 360000

A much slower, but a bit more legible version using by :

do.call(rbind, by(df, df$Yearmon, tail, 1))

How do I deduplicate but control for a condition in R?

1 Answers1