3

I have a huge data.frame with around 200 variables, each represented by a column. Unfortunately, the data is sourced from a poorly formatted data dump (and hence can't be modified) which represents both missing values and zeroes as 0. The data has been observed every 5 minutes for a month, and a day-long period of only 0s can be reasonably thought of as a day where the counter was not functioning, thereby leading to the conclusion that those 0s are actually NAs.

I want to find (and remove) columns that have at least 288 consecutive 0s at any point. Or, more generally, how can we remove columns from a data.frame containing >=k consecutive 0s?

I'm relatively new to R, and any help would be greatly appreciated. Thanks!

EDIT: Here is a reproducible example. Considering k=4, I would like to remove columns A and B (but not C, since the 0s are not consecutive).

df<-data.frame(A=c(4,5,8,2,0,0,0,0,6,3), B=c(3,0,0,0,0,6,8,2,1,0), C=c(4,5,6,0,3,0,2,1,0,0), D=c(1:10))
df
   A B C D
1  4 3 4  1
2  5 0 5  2
3  8 0 6  3
4  2 0 0  4
5  0 0 3  5
6  0 6 0  6
7  0 8 2  7
8  0 2 1  8
9  6 1 0  9
10 3 0 0 10
curious
  • 125
  • 1
  • 12
  • would you mind posting a short format of you data? have a look [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example#5963610). It will be easier for you to find help – DJJ Jun 09 '17 at 14:50
  • If you read @DJJ link and provide a sample dataset I can show you how to use the function on your data using `sapply`. – M-- Jun 09 '17 at 15:02
  • @DJJ added example data – curious Jun 09 '17 at 15:11
  • @Masoud accepted your answer. thanks – curious Jun 09 '17 at 15:16

1 Answers1

1

You can use this function on your data:

cons.Zeros <- function (x, n)
{
    x <- x[!is.na(x)] == 0
    r <- rle(x)
    any(r$lengths[r$values] >= n)
}

This function returns TRUE for the columns that need to be dropped. n is the number of consecutive zeros that you want the column to be dropped for.

For your sample dataset let's use n = 3;

df.dropped <- df[, !sapply(df, cons.Zeros, n=3)]

#output:
# > df.dropped 

#    C  D 
# 1  4  1 
# 2  5  2 
# 3  6  3 
# 4  0  4 
# 5  3  5 
# 6  0  6 
# 7  2  7 
# 8  1  8 
# 9  0  9 
# 10 0 10
M--
  • 25,431
  • 8
  • 61
  • 93