3

I am trying to clean my data. One of the criteria is that I need an uninterrupted sequence of a variable "assets", but I have some NAs. However, I cannot simply delete the NA observations, but need to delete all subsequent observations following the NA event.

Here an example:

productreference<-c(1,1,1,1,2,2,2,3,3,3,3,4,4,4,5,5,5,5)
Year<-c(2000,2001,2002,2003,1999,2000,2001,2005,2006,2007,2008,1998,1999,2000,2000,2001,2002,2003)
assets<-c(2,3,NA,2,34,NA,45,1,23,34,56,56,67,23,23,NA,14,NA)
mydf<-data.frame(productreference,Year,assets)
mydf

#    productreference Year assets
# 1                 1 2000      2
# 2                 1 2001      3
# 3                 1 2002     NA
# 4                 1 2003      2
# 5                 2 1999     34
# 6                 2 2000     NA
# 7                 2 2001     45
# 8                 3 2005      1
# 9                 3 2006     23
# 10                3 2007     34
# 11                3 2008     56
# 12                4 1998     56
# 13                4 1999     67
# 14                4 2000     23
# 15                5 2000     23
# 16                5 2001     NA
# 17                5 2002     14
# 18                5 2003     NA

I have already seen that there is a way to carry out functions by group using plyr and I have also been able to create a column with 0-1, where 0 indicates that assets has a valid entry and 1 highlights missing values of NA.

mydf$missing<-ifelse(mydf$assets>=0,0,1)
mydf[c("missing")][is.na(mydf[c("missing")])] <- 1

I have a very large data set so cannot manually delete the rows and would greatly appreciate your help!

rawr
  • 20,481
  • 4
  • 44
  • 78
AylaHad
  • 33
  • 5
  • Do you want this done using `productreference` as a grouping variable? – Gregor Thomas Jun 15 '16 at 21:51
  • halfway guessing here `mydf[!is.na(ave(assets, productreference, FUN = cumsum)), ]` – rawr Jun 15 '16 at 22:02
  • Since we have `dplyr` and `data.table` answers, I'll annotate rawr's comment to say that's how you'd do it in base. – Gregor Thomas Jun 15 '16 at 22:16
  • Yes, I did mean by productreference, than you. I just tried the "base-way" of doing this but I get an error called "Error in interaction(...) : object 'productreference' not found". Any ideas what might be going wrong? – AylaHad Jun 16 '16 at 20:41

4 Answers4

5

I believe this is what you want:

library(dplyr)
group_by(mydf, productreference) %>%
    filter(cumsum(is.na(assets)) == 0)
# Source: local data frame [11 x 3]
# Groups: productreference [5]
# 
#    productreference  Year assets
#               (dbl) (dbl)  (dbl)
# 1                 1  2000      2
# 2                 1  2001      3
# 3                 2  1999     34
# 4                 3  2005      1
# 5                 3  2006     23
# 6                 3  2007     34
# 7                 3  2008     56
# 8                 4  1998     56
# 9                 4  1999     67
# 10                4  2000     23
# 11                5  2000     23
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you, that is looking promising! Unfortunately, I also get an error here: "Error: could not find function "%>%"." Do I need to install any further packages? – AylaHad Jun 16 '16 at 20:42
  • What version of `dplyr` do you have? `%.%` was deprecated and replaced with `%>%` in version 0.2 (May 2014)... the current version is something like 0.4.3... – Gregor Thomas Jun 16 '16 at 20:51
  • I have version 1.8.4; I just ran the command "update.packages()", but a new version of plyr was not offered. Am I getting something wrong or do I have the most current version? – AylaHad Jun 16 '16 at 21:47
  • 1
    You say `plyr`, I say `dplyr`.The `d` matters. `dplyr` is more-or-less a replacement for `plyr`. It's focused only on data frames (unlike `plyr` which works for lists and arrays too) but is *much* more efficient (and more user-friendly too). – Gregor Thomas Jun 16 '16 at 22:08
3

Here is the same approach using data.table:

library(data.table)
dt <- as.data.table(mydf)

dt[,nas:= cumsum(is.na(assets)),by="productreference"][nas==0]

#    productreference Year assets nas
# 1:                1 2000      2   0
# 2:                1 2001      3   0
# 3:                2 1999     34   0
# 4:                3 2005      1   0
# 5:                3 2006     23   0
# 6:                3 2007     34   0
# 7:                3 2008     56   0
# 8:                4 1998     56   0
# 9:                4 1999     67   0
#10:                4 2000     23   0
#11:                5 2000     23   0
Mike H.
  • 13,960
  • 2
  • 29
  • 39
  • It will hep to set keys for a big data set. – Bulat Jun 15 '16 at 22:29
  • That would be great, as I will need to do further analysis by "productreference"! Unfortunately, I get the error: "Error in `[.data.frame`(small, , `:=`(nas, cumsum(is.na(estimatedassets))), : unused argument (by = "productreference")". Could this be because I have the wrong type for productreference? I imported it as a numeric... – AylaHad Jun 16 '16 at 20:46
  • You need to do this: `dt <- as.data.table(mydf)`. You also need to load/install the `data.table` package. You're getting this error because you're trying to use `data.table` operations on a `data.frame` so you need to convert it using the above code. – Mike H. Jun 16 '16 at 20:48
  • I have converted to a data.table as you suggested and run the third line again. Am still getting an error: "Error in `:=`(nas, cumsum(is.na(estimatedassets))) : Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are defined for use in j, once only and in particular ways." I just read the help for ":=", but am still none the wiser. – AylaHad Jun 16 '16 at 21:54
  • A quick google of your error gives this: http://stackoverflow.com/questions/27980835/r-data-table-works-in-direct-call-but-same-function-in-a-package-fails. Are you trying to use this call in a package? – Mike H. Jun 16 '16 at 22:06
2

Here is a base R option

mydf[unsplit(lapply(split(mydf, mydf$productreference),
     function(x) cumsum(is.na(x$assets))==0), mydf$productreference),]    
#   productreference Year assets
#1                 1 2000      2
#2                 1 2001      3
#5                 2 1999     34
#8                 3 2005      1
#9                 3 2006     23
#10                3 2007     34
#11                3 2008     56
#12                4 1998     56
#13                4 1999     67
#14                4 2000     23
#15                5 2000     23

Or an option with data.table

library(data.table)
setDT(mydf)[, if(any(is.na(assets))) .SD[seq(which(is.na(assets))[1]-1)] 
                    else .SD, by = productreference]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you very much, both these approaches worked perfectly. Are there any pros or cons of using the base R approach over plyr or data.table? With over 230,000 observations I am worried about running time... – AylaHad Jun 16 '16 at 20:59
  • 230,000 isn't *so* big. `dplyr` will probably be faster than `base`, and `data.table` will be the fastest. The difference will probably be a few seconds--maybe less between the `dplyr` and `data.table` versions--which if you're running it once won't much matter. But the real answer is to try it and see! – Gregor Thomas Jun 16 '16 at 22:44
1

You can do it using base R and a for loop. This code is a bit longer than some of the code in the other answers. In the loop we subset mydf by productreference and for every subset we look for the first occurrence of assets==NA, and exclude that row and all following rows.

mydf2 <- NULL
for (i in 1:max(mydf$productreference)){
  s1 <- mydf[mydf$productreference==i,]
  s2 <- s1[1:ifelse(all(!is.na(s1$assets)), NROW(s1), min(which(is.na(s1$assets)==T))-1),]
  mydf2 <- rbind(mydf2, s2)
  mydf2 <- mydf2[!is.na(mydf2$assets),]
}
mydf2
milan
  • 4,782
  • 2
  • 21
  • 39
  • Thank you, this also worked nicely. It did create about 7 additional empty observations in the first few rows, but simply shifted the data frame. So no problem to simply delete the rows afterwards. – AylaHad Jun 16 '16 at 21:11
  • 1
    You're right, and that's because I forgot to exclude rows with assets==NA when they were first in a subset based on productreference. I changed it in the code above. Thanks. – milan Jun 17 '16 at 01:23