-3

I got a big data set that contains monthly returns of a given stock. I'd like to delete rows that do not have a full year data. A subset of data is shown below as an example:

Date        Return  Year        
9/1/2009    0.71447 2009
10/1/2009   0.48417 2009
11/1/2009   0.90753 2009
12/1/2009   -0.7342 2009
1/1/2010    0.83293 2010
2/1/2010    0.18279 2010
3/1/2010    0.19416 2010
4/1/2010    0.38907 2010
5/1/2010    0.37834 2010
6/1/2010    0.6401  2010
7/1/2010    0.62079 2010
8/1/2010    0.42128 2010
9/1/2010    0.43117 2010
10/1/2010   0.42307 2010
11/1/2010   -0.1994 2010
12/1/2010   -0.2252 2010

Ideally, the code will remove the first four observations since they don't have a full year of observation.

C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
Roger
  • 355
  • 1
  • 2
  • 15

1 Answers1

1

The OP has requested to remove all rows from a large data set of monthly values which do not make up a full year. Although the solution suggested by Wen seems to be working for the OP I would like to suggest a more robust approach.

Wen's solution counts the number of rows per year assuming that there is exactly one row per month. It would be more robust to count the number of unique months per year in case there are duplicate entries in the production data set. (From my experience, one cannot be careful enough when dealing with production data and check all assumptions).

library(data.table)
# count number of unique months per year,
# keep only complete years, omit counts
# result is a data.table with one column Year
full_years <- DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"]
full_years
   Year
1: 2010
# right join with original table, only rows belonging to a full year will be returned
DT[full_years, on = "Year"]
          Date   Return Year
 1: 2010-01-01  0.83293 2010
 2: 2010-02-01  0.18279 2010
 3: 2010-03-01  0.19416 2010
 4: 2010-04-01  0.38907 2010
 5: 2010-05-01  0.37834 2010
 6: 2010-06-01  0.64010 2010
 7: 2010-07-01  0.62079 2010
 8: 2010-08-01  0.42128 2010
 9: 2010-09-01  0.43117 2010
10: 2010-10-01  0.42307 2010
11: 2010-11-01 -0.19940 2010
12: 2010-12-01 -0.22520 2010

Note that this approach avoids to add a count column to each row of a potentially large data set.

The code can be written more concisely as:

DT[DT[, uniqueN(month(Date)), by = Year][V1 == 12L, -"V1"], on = "Year"]

It is also possible to check the data for any duplicate months, e.g.,

stopifnot(all(DT[, .N, by = .(Year, month(Date))]$N == 1L))

This code counts the number of occurrences for each year and month and halts execution when there is more than one.

Uwe
  • 41,420
  • 11
  • 90
  • 134