1

I want to run a single function (i.e. calculating the Gini coefficient using the DescTools library) on a vector of length 40k.

set.seed(42)
my_vec = sample(1:100000, 40000, replace = T)

#the function to get the Gini with confidence interval
DescTools::Gini(my_vec, conf.level = 0.99)

Calculating the confidence interval (calculating just the Gini coefficient without the confidence interval works perfectly in no time) results in some memory issues on my machine (64 bit R version, 8 GB RAM) and returns

Error: vector memory exhausted (limit reached?)

To solve this, I looked into these options:

  • increase memory available to R but have not found an option for that on the Mac (memory.limit() seems to be only for Windows)
  • Run the function in parallel using the parallel R library

I'm struggling with the latter because the function does not require any iteration over multiple columns. So I'd not expect a parallelization to work:

mclapply(my_vec, function(x) Gini(x, unbiased = T, conf.level = 0.99), mc.cores = 3) #does not work

Is there a way to avoid the memory issue, and if parallelization is a solution, how could I implement it for the one vector? Thanks a lot!

ben_aaron
  • 1,504
  • 2
  • 19
  • 39
  • I imagine that the main memory usage comes from storing the bootstrapped samples for the confidence interval calculation. You could try to reduce the number of bootstrap samples by setting `R = 100` (default is `R = 1000`); whether this will be sufficient is a different question. – Maurits Evers Jul 13 '18 at 08:15

1 Answers1

2

You have an implementation of the Lorenz curve and the Gini index in RevoScaleR that allows you to obtain calculations by chunks, regardless of the size of the vector.

set.seed(42)
my_vec = data.frame(V1 = sample(1:100000, 40000, replace = T))

# Compute Lorenz
lorenzOut <- rxLorenz(orderVarName = "V1", data = my_vec)

# Compute the Gini Coefficient
giniCoef <- rxGini(lorenzOut)
giniCoef
0.335597

CI:

boot <- replicate(1000, rxGini(rxLorenz(orderVarName = "V1", 
                                data = my_vec[sample.int(nrow(my_vec), nrow(my_vec), replace = TRUE), , drop = FALSE], reportProgress = 0)))

summary(boot)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3315  0.3347  0.3356  0.3356  0.3364  0.3396 

quantile(boot, probs = c(0.005, 0.995))
     0.5%     99.5% 
0.3324822 0.3389219 
  • I'm curious: Can `rxGini` calculate bootstrap-based CIs as well? – Maurits Evers Jul 13 '18 at 08:20
  • 2
    Not directly, but it allows you to program it easily. The problem of `DescTools :: Gini` is that it returns NAs by overflow and even if you do the bootstrap by chunks you will not get the desired result with big data. – Juan Antonio Roldán Díaz Jul 13 '18 at 08:42
  • Thanks a lot. I wasn't aware of the RevoScaleR package. A simpler workaround is even your solution with the own bootstrapping via `replicate(...)`. Might not work with more enormous data sets though - I agree. – ben_aaron Jul 18 '18 at 15:30