2

I have read in a CSV and would like to find "empty" rows and columns, applying something like isempty = function(x) all(is.na(x) | x == 0 | x == "") to all columns. The first column is of mode character, all others are numeric.

However, when I do emptycols = apply(mydf, 2, isempty) the logical vector that is returned is all FALSE.

When I try emptycols = apply(mydf[ , -1], 2, isempty) it works perfectly, returning a logical vector which is TRUE for all "empty" columns.

I am aware that I could just use sapply, which works fine anyway, still I wonder: What causes this behaviour? How can the first (character) column affect the application of my function to all the other columns?

miura
  • 195
  • 6
  • Please can you provide sample data. – Richie Cotton Aug 17 '12 at 09:17
  • I cannot replicate this behaviour, so I would second Richie's suggestion of providing data. – seancarmody Aug 17 '12 at 09:20
  • 5
    If it's a data frame you might want to try `sapply(mydf, isempty)` instead, since `apply` is intended for matrices and arrays. I have a feeling the first column causes `apply` to turn your data frame into a character matrix, in which `"0"` would not match 0. However, when you `[,-1]` it gets turned into a numeric matrix and it works fine. – Backlin Aug 17 '12 at 09:20
  • Here is an example with a mix of characters and numbers, including characters in the first column, which does not exhibit the problem: `x <- data.frame(lab="fish", x=1:3, y=c(2,5,2), z=NA, a=c(NA,2,3), b=0, c="")` and `apply(x, 2, function(x) all(is.na(x) | x==0 | x==""))` gives the right result. – seancarmody Aug 17 '12 at 09:24
  • Unfortunately I can't provide the data that brought up this kind-of problem because my boss wouldn't like that, but I think Backlin has made a very good suggestion. While the code seancarmody provided does work, it seems like for every application of isempty to each column the data get coerced back to numeric if possible, as apply(x, 2, class) returns all "character" for these data, suggesting that the dataframe gets coerced to a character matrix upon calling apply. – miura Aug 17 '12 at 09:31
  • This can also be illustrated by `x <- data.frame(lab="fish", x=1:3, y=c(2,5,2), z=NA, a=c(NA,2,3), b=0, c="")` and `apply(x, 2, identity)` – miura Aug 17 '12 at 09:36
  • You could include this information into your original question. – Roman Luštrik Aug 17 '12 at 09:37
  • There's always a way to "sanitize" your data. E.g., if you're collecting pricing data on grocery stores, just fill the `$storename` column with "AcmeFoods, P&A, Shop&Drop" etc. :-) . If Sacha's answer doesn't work w/ your data, try small subsets until you find the (potentially) offending data value. – Carl Witthoft Aug 17 '12 at 12:04
  • I can't reproduce your problem with the example dataset. Please see http://stackoverflow.com/q/5963269/567015 for tips on how to make a reproducible example. – Sacha Epskamp Aug 17 '12 at 12:08

1 Answers1

2

@Backlin was right. If you change isemtpy like this:

isempty = function(x) c(typeof(x), all(x == 0 | is.na(x) | x == ""))

The following results show what happens:

> apply(mydata, 2, isempty)
     one         two         three      
[1,] "character" "character" "character"
[2,] "FALSE"     "FALSE"     "FALSE" 

> apply(mydata[,-1], 2, isempty)
     two       three    
[1,] "integer" "integer"
[2,] "TRUE"    "TRUE"   

Quoting @Backlin: "the first column causes apply to turn your data frame into a character matrix, in which "0" would not match 0. However, when you [,-1] it gets turned into a numeric matrix and it works fine."

sapply behaves itself better:

> sapply(mydata, isempty)
     one         two       three    
[1,] "character" "integer" "integer"
[2,] "FALSE"     "TRUE"    "TRUE"   
ROLO
  • 4,183
  • 25
  • 41