3

I am trying to conduct a bootstrapped regression model using code from Andy Field's textbook Discovering Statistics Using R.

I am struggling to interpret an error message that I receive when running the boot() function. From reading other forum posts I understand that it is telling me that there is an imbalance in the number of items between two objects, but I don't understand what this means in my context and how I can resolve it.

You can download my data here (a publicly available Dataset on Airbnb listings) and find my code and the full error message below. I am using a mixture of factored dummy variables and continuous variables as predictors. Thanks in advance for any help!

Code:

bootReg <- function (formula, data, i)
{
d <- data [i,]
fit <- lm(formula, data = d)
return(coef(fit))
}

bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy + 
                  host_since_cat + host_location_cat + host_response_time + 
                  host_is_superhost + host_listings_cat + property_type + room_type + 
                  accommodates + bedrooms + beds + price + security_deposit + 
                  cleaning_fee + extra_people + minimum_nights + amenityBreakfast + 
                  amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting + 
                  amenitySmoking + amenityPets + amenityWifi + amenityTV,
                  data = listingsRating, R = 2000)

Error:

Error in t.star[r, ] <- res[[r]] : 
number of items to replace is not a multiple of replacement length
In addition: Warning message:
In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
David Metcalf
  • 664
  • 1
  • 7
  • 13
  • 1
    The order of the arguments is wrong, try `bootReg <- function (data, i, formula)`. – Rui Barradas Oct 27 '18 at 16:53
  • @RuiBarradas thank you for the suggestion, but implementing this eventually yields the same error message as in my original post (albeit without the additional warning message about restarting interrupted promise evaluations) – David Metcalf Oct 27 '18 at 17:25
  • Note the order of arguments of `boot`. The dots argument `...` will become your `formula` argument. So maybe keep the order of the previous args: `boot(data = listingsRating, statistic = bootReg, R = 2000, formula = review_scores_rating ~ etc`. – Rui Barradas Oct 27 '18 at 17:33

2 Answers2

6

The Problem

The problem is your factor variables. When you do an lm() on a subset of your data (which is done over and over again in boot::boot()), you only get coefficients for the factor levels that are present. Then each coefficient draw could be of different lengths. This can be reproduced if you do

debug(boot)
set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ instant_bookable + cancellation_policy + 
                        host_since_cat + host_location_cat + host_response_time + 
                        host_is_superhost + host_listings_cat + property_type + room_type + 
                        accommodates + bedrooms + beds + price + security_deposit + 
                        cleaning_fee + extra_people + minimum_nights + amenityBreakfast + 
                        amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting + 
                        amenitySmoking + amenityPets + amenityWifi + amenityTV,
                    data = listingsRating, R = 2)

which will allow you to move through the function call one line at a time. After you run the line

res <- if (ncpus > 1L && (have_mc || have_snow)) {
    if (have_mc) {
        parallel::mclapply(seq_len(RR), fn, mc.cores = ncpus)
    }
    else if (have_snow) {
        list(...)
        if (is.null(cl)) {
            cl <- parallel::makePSOCKcluster(rep("localhost", 
                ncpus))
            if (RNGkind()[1L] == "L'Ecuyer-CMRG") 
                parallel::clusterSetRNGStream(cl)
            res <- parallel::parLapply(cl, seq_len(RR), fn)
            parallel::stopCluster(cl)
            res
        }
        else parallel::parLapply(cl, seq_len(RR), fn)
    }
} else lapply(seq_len(RR), fn)

Then try

setdiff(names(res[[1]]), names(res[[2]]))
# [1] "property_typeBarn"         "property_typeNature lodge"

There are two factor levels present in the first subset not present in the second. This is causing your problem.

The Solution

Use model.matrix() to expand your factors before hand (following this Stack Overflow post):

df2 <- model.matrix( ~ review_scores_rating + instant_bookable + cancellation_policy + 
                        host_since_cat + host_location_cat + host_response_time + 
                        host_is_superhost + host_listings_cat + property_type + room_type + 
                        accommodates + bedrooms + beds + price + security_deposit + 
                        cleaning_fee + extra_people + minimum_nights + amenityBreakfast + 
                        amenityAC + amenityElevator + amenityKitchen + amenityHostGreeting + 
                        amenitySmoking + amenityPets + amenityWifi + amenityTV - 1, data = listingsRating)
undebug(boot)

set.seed(123)
bootResults <- boot(statistic = bootReg, formula = review_scores_rating ~ .,
                    data = as.data.frame(df2), R = 2)

(Note that throughout I reduce R to 2 just for faster runtime during debugging).

duckmayr
  • 16,303
  • 3
  • 35
  • 53
  • Brilliant, many, many thanks for your elaborate answer! I never would have arrived at that solution by myself – David Metcalf Oct 28 '18 at 12:07
  • @DavidMäder No problem, glad it helped! Yeah, `debug()` has helped me get through many extremely hard to figure how problems, it's a good trick to keep in mind. – duckmayr Oct 28 '18 at 12:09
1

The way you are defining bootReg and calling it are wrong.
First, you must keep to the order of arguments of the function statistic, in this case bootReg. The first argument is the dataset and the second argument is the indices. Then come other, optional arguments.

bootReg <- function (data, i, formula){
  d <- data[i, ]
  fit <- lm(formula, data = d)
  return(coef(fit))
}

Second, in the call, the other optional arguments will be passed in the dots ... argument. So once again, keep to the order of arguments as defined in help("boot"), section Usage.

bootResults <- boot(data = iris, statistic = bootReg, R = 2000, 
                    formula = Sepal.Length ~ Sepal.Width)

colMeans(bootResults$t)
#[1]  6.5417719 -0.2276868
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thanks for your proposed solution! I updated bootReg and ran boot() with the order that you specified (data, statistic, R, formula) as you specified, but it still yields the same error as in the original post. If I think about it then the error comes after the code runs for ca. 5-10 minutes, so I don't think that there is an issue with formula syntax since R is clearly computing something. – David Metcalf Oct 27 '18 at 18:15
  • @DavidMäder Can you post sample data? Please edit the question with the output of `dput(listingsRating)`. Or, if it is too big with the output of `dput(head(listingsRating, 20))`. – Rui Barradas Oct 27 '18 at 18:18
  • @RuiBarrads Thanks for your help, but duckmayr identified the issue so the problem is resolved – David Metcalf Oct 28 '18 at 12:06