Can I use the apply family to get a stat on each column of many dataframes

Question

Good morning Stack Overflow,

Getting some statistics (whatever) on the columns of a dataframe might be done with the (s)apply function. I am wondering whether it could be possible to get such statistics on each column for each different dataframe using the apply family?

Number of missing values per column (1 dataframe):

dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
sapply(dataf, function(x) {sum(is.na(x))})

I have thought about making a list of dataframes but the statistics is then conglomerated on the elements of the list (i.e. dataframe) although I want it to be calculated on the columns. Any idea?

Have a nice day,

Anthony

`lapply(list, function(x) sapply(x, function(y) sum(is.na(y))))` might be worth a try — missuse, Aug 28 '18 at 09:03
@missuse Thank you! I still have lots to learn xD! Have a nice day! — guiotan, Aug 28 '18 at 09:07
@missuse How would you replace the NA with 0, considering this multiple dataset issue? I have tried your logic but it is not working with the code: `lapply(li, function(dataf) sapply(dataf, function(col) { mutate_all(col, funs(ifelse(is.na(.), 0, .))) }))` — guiotan, Aug 28 '18 at 10:15
try: `lapply(li, function(x) sapply(x, function(y) ifelse(is.na(y), 0, y)))` or `lapply(li, function(x) mutate_all(x, funs(ifelse(is.na(.), 0, .))))` — missuse, Aug 28 '18 at 10:18
I would `lapply(li, function(x){x[is.na(x)] <- 0; return(x);})` — Arno, Aug 28 '18 at 13:27

score 1 · Accepted Answer · answered Aug 28 '18 at 09:14

1

In general it is a good idea to save your dataframes in a list if you want to do similar things with them. See for more information the excellent answer of @gregor in this question How do I make a list of data frames? .

The comment of @missuse is correct. Tested on your example:

dataf <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)
dataf2 <- data.frame(list(a = 1:3, b = c(NA, 3:4)), row.names = c("x","y","z"), stringsAsFactors = FALSE)

li <- list(dataf,dataf2)

lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
> lapply(li, function(x) sapply(x, function(y) sum(is.na(y))))
[[1]]
a b 
0 1 

[[2]]
a b 
0 1

answered Aug 28 '18 at 09:14

Arno

207
2
9

Thank you! I will think about it in the future ;). Have a nice day! – guiotan Aug 28 '18 at 09:23
How would you replace the NA with 0, considering this multiple dataset issue? I have tried your logic but it is not working with the code: `lapply(li, function(dataf) sapply(dataf, function(col) { mutate_all(col, funs(ifelse(is.na(.), 0, .))) }))` – guiotan Aug 28 '18 at 10:15
If you want to replace all the NAs with zero you can 1) create a function that replaces the NA with expected value (in this case 0) `fill_na <- function(x){ x[is.na(x)] <- 0 return(x) }` and 2) apply it to all the dataframes in your list: `lapply(li, fill_na)` – Arno Aug 28 '18 at 12:54

Can I use the apply family to get a stat on each column of many dataframes

1 Answers1