1

I need to replace empty cells with zero (0) in R. I have a data frame like this:

dput(df)

structure(list(CHANNEL = structure(c(1L, 1L, 1L), .Label = "Native BlackBerry App", class = "factor"), 
    DATE = structure(c(1L, 1L, 1L), .Label = "01/01/2011", class = "factor"), 
    HOUR = structure(c(3L, 1L, 2L), .Label = c("1:00am-2:00am", 
    "2:00am-3:00am", "Midnight-1:00am"), class = "factor"), UNIQUE_USERS = structure(c(1L, 
    1L, 1L), .Label = "", class = "factor"), LOGON_VOLUME = structure(c(1L, 
    1L, 1L), .Label = "", class = "factor")), .Names = c("CHANNEL", 
"DATE", "HOUR", "UNIQUE_USERS", "LOGON_VOLUME"), row.names = c(NA, 
-3L), class = "data.frame")

I have this function:

sapply(df, function (x) 
     as.numeric(gsub("(^ +)|( +$)", "0", x))) 

I get these errors, not working.

[ reached getOption("max.print") -- omitted 422793 rows ]
Warning messages:
1: In FUN(X[[4L]], ...) : NAs introduced by coercion
2: In FUN(X[[4L]], ...) : NAs introduced by coercion
3: In FUN(X[[4L]], ...) : NAs introduced by coercion
4: In FUN(X[[4L]], ...) : NAs introduced by coercion

update: when I apply this function to df:

sapply(df, function (x) gsub("(^ +)|( +$)", "0", x) )

I get this:

  CHANNEL                 DATE         HOUR              UNIQUE_USERS LOGON_VOLUME
[1,] "Native BlackBerry App" "01/01/2011" "Midnight-1:00am" ""           ""          
[2,] "Native BlackBerry App" "01/01/2011" "1:00am-2:00am"   ""           ""          
[3,] "Native BlackBerry App" "01/01/2011" "2:00am-3:00am"   ""           ""  
Cœur
  • 37,241
  • 25
  • 195
  • 267
user1471980
  • 10,127
  • 48
  • 136
  • 235

1 Answers1

4

You define an anonymous function in sapply then never use the argument to the function.

sapply(df, function (x) gsub("(^ +)|( +$)", "0", x) ) #===> change df to x

You also coerce everything to a numeric value resulting in NA values for strings with non digits in. Since each column of the data.frame is an atomic vector it can only contain one type of data. The common data type for all elements is therefore character.

Perhaps you meant to do this...

sapply( df , gsub , pattern = "^\\s*$" , replacement = 0 )

     CHANNEL                 DATE         HOUR              UNIQUE_USERS LOGON_VOLUME
[1,] "Native BlackBerry App" "01/01/2011" "Midnight-1:00am" "0"          "0"         
[2,] "Native BlackBerry App" "01/01/2011" "1:00am-2:00am"   "0"          "0"         
[3,] "Native BlackBerry App" "01/01/2011" "2:00am-3:00am"   "0"          "0"  

Using gsub you'll have to convert to an integer afterwards and you will also get NA for any column which contains something other than a character representation of a number. If you need to change entire columns you could check if the entire column is empty and replace with zero if it is. You can't have character elements and numeric elements in the same column.

len <- colSums( sapply( df , grepl , pattern = "^\\s*$" ) )    
df[ , len > 0 ] <- rep( 0 , nrow(df) )
#                CHANNEL       DATE            HOUR UNIQUE_USERS LOGON_VOLUME
#1 Native BlackBerry App 01/01/2011 Midnight-1:00am            0            0
#2 Native BlackBerry App 01/01/2011   1:00am-2:00am            0            0
#3 Native BlackBerry App 01/01/2011   2:00am-3:00am            0            0
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • I am trying to do dput(head(df,2) get a subset of a huge df, but for some reason it is not working. It is giving me the whole df output in dput format. I followed your suggestions without any success. – user1471980 Sep 11 '13 at 21:03
  • @user1471980 It should give you the first two rows, assuming `df` really is a `data.frame`? `class( df )`? – Simon O'Hanlon Sep 11 '13 at 21:08
  • @user1471980 Ok. You can't have character and numeric elements in the same column. – Simon O'Hanlon Sep 11 '13 at 21:28