0

I'm having difficulty creating a loop to identify missing values in a column. I'm using this loop to add columns into a smaller dataset, impute with mice, then merge back. I can't write manually within my function because the output data will be a smaller subset based on area code and each area code has different missing values in different columns.

For reference:

    library(mice)
    ListingPricePrep<-function(Zip,dataset){
    City<-subset.data.frame(dataset,dataset$ZipCode==Zip)

    #Fault Area#

    t1<-mice(City[,c(7,12:13,15:16,21:22,24:25,27:28,30:31)],m=1,method = "norm")
    t2<-mice(City[,9:10])

    df1<-mice::complete(t1)
    df2<-mice::complete(t2)

    City<-cbind.data.frame(City[,c(1:3,5,6,8,11,14,20,23,26,29)],df1,df2)
    City$LPB<-ifelse(City$`Median Listing Price`>mean(City$`Median Listing Price`)+sd(City$`Median Listing Price`),1,0)
    City$LPMMB<-ifelse(City$`Median Listing Price M/M`>0,1,0)
    City$LPYYB<-ifelse(City$`Median Listing Price Y/Y`>0,1,0)
    City$ALCMMB<-ifelse(City$`Active Listing Count M/M`>0,1,0)
    City$ALCYYB<-ifelse(City$`Active Listing Count Y/Y`>0,1,0)
    City$DOMMMB<-ifelse(City$`Days on Market M/M`>0,1,0)
    City$DOMYYB<-ifelse(City$`Days on Market Y/Y`>0,1,0)
    City$NLCMMB<-ifelse(City$`New Listing Count M/M`>0,1,0)
    City$NLCYYB<-ifelse(City$`New Listing Count Y/Y`>0,1,0)
    City$ALPMMB<-ifelse(City$`Avg Listing Price M/M`>0,1,0)
    City$APLYYB<-ifelse(City$`Avg Listing Price Y/Y`>0,1,0)
    City$TLCMMB<-ifelse(City$`Total Listing Count M/M`>0,1,0)
    City$TLCYYB<-ifelse(City$`Total Listing Count Y/Y`>0,1,0)
    City$MonthName<-month(City$Month)

    fits <- list(normal = fitdistr(City$`Median Listing Price`, "normal"),
         weibull = fitdistr(City$`Median Listing Price`, "weibull"),
         lognormal= fitdistr(City$`Median Listing Price`,"lognormal"),
         logistic= fitdistr(City$`Median Listing Price`,"logistic"),
         cauchy= fitdistr(City$`Median Listing Price`,"cauchy"),
         poisson= fitdistr(City$`Median Listing Price`,"poisson"),
         t= fitdistr(City$`Median Listing Price`,"t")
         )

    print(sort(sapply(fits,function(i) i$loglik),decreasing = T))
    return(City)

}

Some outputs will work as they have missing values in the specified columns above, and others will return: Error in mice(City[, c(7, 12:13, 15:16, 21:22, 24:25, 27:28, 30:31)], : No missing values found

This is a fun project for me and I can force it to work by picking and choosing the right cities, but I want to practice making functional...functions.

So far, I'm pretty off-base with the loop shown below:

    for (i in 1:length(dataset)) for (j in ncol(dataset)){
      dat<-names(dataset[is.na(dataset[i,j])==T])
    }

p.s. feel free to comment with suggestions on cleaning up code or optimization you see fit. Everyone's got a different style.

Edit

Sample Data Picture

I just want to impute things like Median Listing Price Y/Y which are sometimes missing and sometimes not, depending on the month and area code.

M--
  • 25,431
  • 8
  • 61
  • 93
ASavage
  • 33
  • 1
  • 8

2 Answers2

1

What i notice is that your dat does not change. So even if this worked, it would keep only the last value. Try this library(data.table) dt <- as.data.table(dataset) dt[,lapply(.SD,function(x){sum(is.na(x))>0})] # this will give you which columns have NA dt[,is.na(dt),with=F]

quant
  • 4,062
  • 5
  • 29
  • 70
  • @ycw Editted it. Thanks :) – quant Jun 28 '17 at 14:01
  • I'm getting > dt[,is.na(dt),with=F] Error in `[.data.table`(dt, , is.na(dt), with = F) : Item 1 of j is 185 which is outside the column number range [1,ncol=34] – ASavage Jun 28 '17 at 14:34
1
dt <- as.data.table(dataset)
d1<-data.frame(dt[,lapply(.SD,function(x){sum(is.na(x))>0})]) # this will give you which columns have NA
d2<-(which(d1[1,]==TRUE))
d3<-dataset[,d2]

Thank you @quant for the kick start idea. I'll create a second subset with ==False and merge the results back together.

ASavage
  • 33
  • 1
  • 8