0

I am trying to stop using for loops when I code but I have a bit of a problem representing a simple operation.

Let's say I am trying to do simple nearest-neighbour estimation on a dataset for a company that owns several restaurants. I have three features: City, Store, Month and one target function Sales. City,Store and Month are all represented with numbers: City takes values between 1-100, Store takes values between 1-50 and Month between 1-12.

Now, I want to replace this for-loop with an apply function:

for (c in 1:100){
 for (s in 1:50){
  for (m in 1:12){
   dat1$Sales[dat1$City==c & dat1$Store==s & dat1$Month==m & is.na(dat1$Sales)] <-
    mean(dat1$Sales[dat1$City==c & dat1$Store==s & dat1$Month==m & !is.na(dat1$Sales)])
  }
 }
}

What is the complexity of this apply function?

Many thanks!

thecheech
  • 2,041
  • 3
  • 18
  • 25
  • Using a data.table, `dat1[,Sales:=Sales[!is.na(Sales)],by='c,s,m']` – Frank Apr 11 '14 at 18:45
  • 2
    Please review [how to make a great reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). A `dput()` of some of the data frame would be cool to help folks help you. – hrbrmstr Apr 11 '14 at 18:48
  • What is this doing? At first glance it just looks like you are removing the rows where sales is NA. Can you show us a little more of the data.frame? – John Paul Apr 11 '14 at 19:02
  • Sorry I wasn't clear - I updated the question. I'm replacing each missing item with average of similar items. I didn't include snapshots of the dataset because it seems very easy and trivial. But I'm guessing it's not... – thecheech Apr 11 '14 at 19:06
  • 2
    Ah, I didn't notice the `mean`. In that case, `dat1[,Sales:=mean(Sales[!is.na(Sales)]),by='c,s,m']` – Frank Apr 11 '14 at 19:30
  • 1
    @Frank will that replace all of the `Sales` values with the mean or only those `Sales` which are `NA` ? – John Paul Apr 11 '14 at 19:38
  • @JohnPaul Oh you are right, hmm. One probably inefficient way would be `dat1[,Sales:={mm=mean(Sales,na.rm=TRUE); ifelse(is.na(Sales),mm,Sales)},by='c,s,m']` I think – Frank Apr 11 '14 at 19:47

2 Answers2

3

Try using aggregate. It has a formula like interface that makes it easy to get the results of a function applied on parts of a data.frame. Then just assign the result to the place in dat1 that needs it.

TempOut<- aggregate(Sales~City+Store+Month, FUN=mean,data=dat1)

dat1$Sales[is.na(dat1$Sales),]<-TempOut[TempOut$City==[dat1[is.na(dat1$Sales),]$City 
& TempOut$Store==[dat1[is.na(dat1$Sales),]$Store & TempOut$Month==
[dat1[is.na(dat1$Sales),]$Month,]$Sales

You could combine the creation of TempOut and assignment to dat1$Sales into one line, but that would have made this even harder to read. I don't have your data so I can't test this - but this should get you on the right track, even if there is a typo in there.

John Paul
  • 12,196
  • 6
  • 55
  • 75
3

Here's a data.table way:

require(data.table)
setDT(dat1)

dat1[, Sales:={
  m=mean(Sales,na.rm=TRUE)
  replace(Sales, is.na(Sales), m)
},by=.(City, Store, Month)]

It would be nice to have something like Sales[is.na(Sales)]:=..., but this is just a feature request right now. Here is a similar question.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180