0

Is any elegant solution to clean a dataframe from NA values without remove the row or column where the NA is?

Example:

Input dataframe

    C1    C2     C3
 R1  A   <NA>  <NA>
 R2 <NA>  A    <NA>
 R3 <NA> <NA>   A
 R4  B   <NA>  <NA>
 R5 <NA>  B    <NA>
 R6 <NA> <NA>  <NA>
 R7  C   <NA>   B
 R8       C    <NA>
 R9            <NA>
 R10           <NA>
 R11            C

Output dataframe

    C1  C2  C3
R1  A   A   A
R2  B   B   B
R3  C   C   C

For example, here is a messy dataframe (df1) full of NA values

    A       B       C       D       E       F    G    H    I    J    K
1 Healthy    <NA>    <NA>    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
2    <NA> Healthy    <NA>    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
3    <NA>    <NA> Healthy    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
4    <NA>    <NA>    <NA> Healthy    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
5    <NA>    <NA>    <NA>    <NA> Healthy    <NA> <NA> <NA> <NA> <NA> <NA>
6    <NA>    <NA>    <NA>    <NA>    <NA> Healthy <NA> <NA> <NA> <NA> <NA>

Here is how it should be the dataframe.

   X1        X2        X3      X4        X5        X6        X7      X8      X9       X10       X11
1 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
2 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
3 Healthy ICDAS_1_2 ICDAS_1_2 Healthy ICDAS_1_2 ICDAS_1_2 ICDAS_1_2 Healthy Healthy ICDAS_1_2 ICDAS_1_2
4 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
5 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
6 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy

Note that the cell B-2 from the original dataframe now is in the X2-1. So the main issue here is to find the equivalent to "delete the cell and move all the cells up" function from Calc or Excel

All the answers that I found delete all the row or column where the <NA> value was. The way I managed to do it is (and sorry if this is primitive) was to extract only the valid values to a new dataframe:

First. I create an empty dataframe

library("data.table") # required package
new_dataframe <-  data.frame(matrix("", ncol = 11, nrow = 1400) )

Then, I copy every valid value from the old to the new dataframe

new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]

etc

So, my question is: is any more elegant solution to "clean" a dataframe from NA values?

Any help is greatly appreciated.

Mr Lister
  • 45,515
  • 15
  • 108
  • 150
sergiouribe
  • 221
  • 1
  • 3
  • 11
  • 1
    You'll see a selection of functions if you type `?na.fail` – Frank Jan 05 '16 at 18:51
  • It is not clear how the expected output maps to the original data frame provided – Pierre L Jan 05 '16 at 18:54
  • 1
    are you trying to fill in na values? http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-r – MarkAWard Jan 05 '16 at 18:55
  • Thanks for the comments; i'd edited the question, adding the main issue: how to delete cells with NA values and move the rest of the cells up in a dataframe – sergiouribe Jan 05 '16 at 18:56
  • Your example is confusing because in your input example the only non-NA values are `"Healthy"`. However, in your sample output almost all the NAs are replaced with `"Healthy"`, but some are replaced with other values like `"ICDAS_1_2"`. A much better example show sample input and the desired output *for that input*. – Gregor Thomas Jan 05 '16 at 19:00
  • 1
    What you describe, "delete and move all cells up" can be done with `new_data = lapply(old_data, na.omit)`. The result cannot be a data frame unless the resulting data is rectangular. Data frames have few restrictions, but an important one is that all columns must have the same number of rows. If your data has different numbers of non-NA values in each column, then your result will just be a list. – Gregor Thomas Jan 05 '16 at 19:03
  • 1
    The main problem is that a data frame is a list of vectors of equal lengths. R will attempt to recycle shorter length vectors to match the longest in the case that list items are uneven, but you are opening a can of worms. Here is a way `as.data.frame(lapply(mydf, function(x) x[!is.na(x)]))` or as Gregor mentions `as.data.frame(lapply(mydf, na.omit))` – Pierre L Jan 05 '16 at 19:11
  • Strongly related question: [Remove NAs from data frame without deleting entire rows/columns](http://stackoverflow.com/q/22560395/903061). Tempted to mark as duplicate, the only difference is the other one operates on rows, not columns. – Gregor Thomas Jan 05 '16 at 19:13
  • input and output example provided. Previous answers from strongly related questions delete rows, columns or alter the dataframe. Thanks again for your help. – sergiouribe Jan 05 '16 at 19:21
  • In the strongly related question's solutions, if you replace `apply(..., 1, ...)` with `apply(..., 2, ...)` to operate on columns instead of rows, the solutions will probably work. But my solution below is a little more straightforward since data frames are made to be operated on column-wise. – Gregor Thomas Jan 05 '16 at 19:26

1 Answers1

1

If this works for you manually:

new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]

then this should work automatically:

new_dataframe = as.data.frame(lapply(df1, na.omit))

should also work (on an arbitrary number of columns). (A more direct translation of your code is what Pierre suggested in the comments: as.data.frame(lapply(mydf, function(x) x[!is.na(x)])).)

Beware that data frames must be rectangular (each column must have the same number of rows), so this will work as you might hope and expect only if each column has the same number of non-missing values. If some rows have fewer non-missing values, they will be recycled to fill out the length of the data frame:

x = data.frame(a = c(1, NA, 2), b = c(2, NA, 3), c = c(NA, "A", NA))
x
#    a  b    c
# 1  1  2 <NA>
# 2 NA NA    A
# 3  2  3 <NA>

as.data.frame(lapply(x, na.omit))
#   a b c
# 1 1 2 A
# 2 2 3 A

A better approach might be to just convert to a list first:

y = lapply(x, na.omit)

You can then see what you've got sapply(y, length) before deciding if you want to coerce to data frame or not.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294