1

Good morning Stack Overflow,

Getting some statistics (whatever) on the columns of a dataframe might be done with the (s)apply function. I am wondering whether it could be possible to get such statistics on each column for each different dataframe using the apply family?

Number of missing values per column (1 dataframe):

dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
sapply(dataf, function(x) {sum(is.na(x))})

I have thought about making a list of dataframes but the statistics is then conglomerated on the elements of the list (i.e. dataframe) although I want it to be calculated on the columns. Any idea?

Have a nice day,

Anthony

guiotan
  • 149
  • 13
  • 5
    `lapply(list, function(x) sapply(x, function(y) sum(is.na(y))))` might be worth a try – missuse Aug 28 '18 at 09:03
  • 1
    @missuse Thank you! I still have lots to learn xD! Have a nice day! – guiotan Aug 28 '18 at 09:07
  • @missuse How would you replace the NA with 0, considering this multiple dataset issue? I have tried your logic but it is not working with the code: `lapply(li, function(dataf) sapply(dataf, function(col) { mutate_all(col, funs(ifelse(is.na(.), 0, .))) }))` – guiotan Aug 28 '18 at 10:15
  • 1
    try: `lapply(li, function(x) sapply(x, function(y) ifelse(is.na(y), 0, y)))` or `lapply(li, function(x) mutate_all(x, funs(ifelse(is.na(.), 0, .))))` – missuse Aug 28 '18 at 10:18
  • I would `lapply(li, function(x){x[is.na(x)] <- 0; return(x);})` – Arno Aug 28 '18 at 13:27

1 Answers1

1

In general it is a good idea to save your dataframes in a list if you want to do similar things with them. See for more information the excellent answer of @gregor in this question How do I make a list of data frames? .

The comment of @missuse is correct. Tested on your example:

dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
dataf2 <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)

li <- list(dataf,dataf2)

lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
> lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
[[1]]
a b 
0 1 

[[2]]
a b 
0 1 
Arno
  • 207
  • 2
  • 9
  • Thank you! I will think about it in the future ;). Have a nice day! – guiotan Aug 28 '18 at 09:23
  • How would you replace the NA with 0, considering this multiple dataset issue? I have tried your logic but it is not working with the code: `lapply(li, function(dataf) sapply(dataf, function(col) { mutate_all(col, funs(ifelse(is.na(.), 0, .))) }))` – guiotan Aug 28 '18 at 10:15
  • If you want to replace all the NAs with zero you can 1) create a function that replaces the NA with expected value (in this case 0) `fill_na <- function(x){ x[is.na(x)] <- 0 return(x) }` and 2) apply it to all the dataframes in your list: `lapply(li, fill_na)` – Arno Aug 28 '18 at 12:54