-17

Given a data with missing values, imputation is a process where the missing values are substituted with some values. The goal is to ignore the rows with missing values, denoted with NAs. Such row could be seen as a component of the data hence the process called item imputation.

Input

df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))

or alternatively in CSV format where misings values are marked with NAs

data.csv      data2.csv        data3.csv

ID V1         ID V2            ID V3
1  7          1  6             1  9
2  77         2  NA            2  NA
3  NA         3  66            3  NA
4  NA         4  NA            4  NA
5  777        5  666           5  999
6  NA         6  6666          6  9999

Output

Expected result is

ID V1   V2   V3
1  7    6    9
5  777  666  999

where we we wanted just lines without any NA value.

How to merge the input data with columns V1, V2, V3 and a common column ID with no NA on a row?


Example solution with SQLDF to merge the columns with common ID and no NA

library(sqldf)
# Read in the data: with CSV, you can use read.csv or fread from data.table
df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))
#
sqldf("SELECT a.ID, a.V1, b.V2, c.V3 FROM df1 a, df2 b, df3 c WHERE a.ID=b.ID AND b.ID=c.ID AND V1!='NA'")

resulting to

   ID   V1  V2  V3
1   1    7   6   9
2   5  777 666 999
hhh
  • 50,788
  • 62
  • 179
  • 282
  • 9
    Wow, this has to be one of the worst-asked questions I've encountered. – Joshua Ulrich Aug 10 '11 at 16:10
  • What with ID = 1 ? What are you trying to tell us? Where is your reproducible code? – Joris Meys Aug 10 '11 at 16:12
  • I presume that the expected result is but one line of the expected result? What happened to `ID == 1`? Should `6` and `3` be in the expected result but with an `NA` where the data are missing? Code is great, but don't expect us to decipher the intention of your Q from it alone! – Gavin Simpson Aug 10 '11 at 16:14
  • 1
    @hhh: Really? By what rule should ID 1 be excluded since it belongs in all 3 data.frames just like ID 5 does? – Ari B. Friedman Aug 10 '11 at 16:23
  • @hhh the data shown and your insistence on expected output containing only ID 5 are inconsistent. Can you resolve please? – Gavin Simpson Aug 10 '11 at 16:27
  • No, why is ID 1 excluded - it is present in all 3 csv files – Gavin Simpson Aug 10 '11 at 16:28
  • 1
    Yes, that is better. Note the data you show aren't CSV and it isn't easy for us to work with these data as they aren't reproducible or R code/output. – Gavin Simpson Aug 10 '11 at 16:44

2 Answers2

3

Here is a base R only version, that doesn't care how many merges there are. Assumes the data frames are in list l - see edit to Q for the example data in that format:

for(i in seq_along(l[-1])) {
    if(i == 1) {
        m <- merge(l[[i]], l[[i+1]])
    } else {
        m <- merge(m, l[[i+1]])
    }
}
m <- m[!apply(is.na(m), 1, any), ]

which gives the desired output

> m
  ID  V1  V2  V3
1  1   7   6   9
2  5 777 666 999

To read the data in, something like this should work

l <- lapply(list.files(pattern = glob2rx("data*.csv")), read.table, 
            header = TRUE)

or if they really are CSV

l <- lapply(list.files(pattern = glob2rx("data*.csv")), read.csv)

then you can use the code above to process.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
2

Out of respect for the artform of haiku on fine display in this question, I'm going to provide the following answer/wild-guess:

library(reshape)
dats <- lapply( dir(), read.csv )
mgd <- merge_recurse( dats, by="ID" )
na.sel <- apply( mgd, 1, function(x) any(is.na(x)) )
mgd <- mgd[!na.sel,]

Note this assumes you actually want ID==1.

Thanks to @Joris for the merge_recurse tip.

Ari B. Friedman
  • 71,271
  • 35
  • 175
  • 235
  • tss tss tss... merge_recurse? : http://stackoverflow.com/questions/6942662/combining-tab-delim-files-into-a-single-using-r/6942859#6942859 – Joris Meys Aug 10 '11 at 16:24
  • 1
    Ooh, I likes! Learn something new every day. – Ari B. Friedman Aug 10 '11 at 16:28
  • 1
    @hhh Did this solve your problem? I'd check myself, but without a reproducible example it's impossible. Might I suggest reading http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example ? With you're rep I'm surprised this is necessary. – Ari B. Friedman Aug 10 '11 at 16:37