0

This is an extension of my previous post here.

I'm working in R.

In summary, my vectors are HUGE (13gb) but they're not supposed to be. The original csv file is a fraction of that size. As you can imagine, 13gb is a bit more memory than my machine has, never mind what's allocated to R.

The code I'm currently working off of is:

data1<-read.csv("stackexample.csv") ##read in dummy data
data1C<- data1[,3:13] #cut off the ends
SvDvDis<-data1C[c(-3,-4,-6,-7,-9,-10,-11)] #drop individual columns
attach(ScDcDis) #attach for simplicity sake
sm.ancova(s,dt,dip,model="none") #non-parametric ANCOVA

A dummy-data file can be found on my dropbox.

Is there a way to reduce the memory this function is using, or is there an alternative coding/function that performs the same analysis (non-parametric ANCOVA) in a less memory intensive way? To be clear, not asking about the stats. I'm asking how to do this in a more memory efficient way.

Community
  • 1
  • 1
Jesse001
  • 924
  • 1
  • 13
  • 37
  • Thanks for trying to provide your data. Having said that, Dropbox and Google Drive (etc, etc) links aren't allowed for sharing data because they represent security risks and they tend to break over time. – Hack-R Oct 12 '16 at 23:42
  • 1
    fANCOVA via - https://cran.r-project.org/web/packages/fANCOVA/ - maybe? – thelatemail Oct 12 '16 at 23:42
  • @Hack-R thanks for the heads up. Is there a better/preferred way to share dummy data to provide reproducible examples? @ thelatemail I'll give it a shot and report back, thanks – Jesse001 Oct 12 '16 at 23:44
  • 1
    @Jesse001 Yea, you can just simulate/create it in your code if it's large. When it's small use `dput()`. I downloaded your file and this one would be straightforward to create within the code. Just FYI `attach()` is a really bad command to use as it leads to all sorts of problems (not in this case, but in general). There's a non-trivial faction that wants it removed from the language all together. – Hack-R Oct 12 '16 at 23:46
  • 2
    There is nothing wrong with your data. There is likely a problem with the sm.ancova function. I have no idea what that does. Try contacting the author of the package it's from. – Hong Ooi Oct 13 '16 at 00:04
  • @thelatemail thanks for the tip on fANCOVA. it seems to be running, but taking hours to do 1 ANCOVA. Do you know if this is normal for that package? – Jesse001 Oct 14 '16 at 15:36

1 Answers1

0

This is my suggestion and it worked fine on my humble laptop. You could supplement it with means test to make sure the sample is sufficiently reflective of the population.

data1   <- read.csv("stackexample.csv") ##read in dummy data

library(dplyr)
library(sm)

data2 <- sample_n(data1, 10000) # make statistics work for you -- sample the data
sm.ancova(x     = data2$s,
          y     = data2$dt,
          group = data2$dip,
          model = "none") #non-parametric ANCOVA

enter image description here

Even with a sample of only 1,000 I didn't find any significant differences in the means.

t.test(data1$s, data2$s)
  Welch Two Sample t-test

data:  data1$s and data2$s
t = -1.4469, df = 1017.9, p-value = 0.1482
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -37.657822   5.692622
sample estimates:
mean of x mean of y 
 125.3137  141.2963

With a sample of 5,000:

data2 <- sample_n(data1, 5000) # make statistics work for you -- sample the data
t.test(data1$s, data2$s)
  Welch Two Sample t-test

data:  data1$s and data2$s
t = -1.0653, df = 5513.7, p-value = 0.2868
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -14.736700   4.359704
sample estimates:
mean of x mean of y 
 125.3137  130.5022
t.test(data1$dt, data2$dt)
  Welch Two Sample t-test

data:  data1$dt and data2$dt
t = -0.069479, df = 5507.8, p-value = 0.9446
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18.39645  17.13709
sample estimates:
mean of x mean of y 
 515.6206  516.2503
t.test(data1$dip, data2$dip)
  Welch Two Sample t-test

data:  data1$dip and data2$dip
t = 1.2044, df = 5536.3, p-value = 0.2285
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.6268062  2.6241395
sample estimates:
mean of x mean of y 
 126.6667  125.6680

and of course, you can use more/difference statistics to validate your sample depending on how far you want to take it. You could also estimate a Power Curve beforehand to determine the sample size.

With a sample of 10,000 it took about 3 minutes to complete on my laptop. With a sample of 1,000 it finished instantly.

Hack-R
  • 22,422
  • 14
  • 75
  • 131
  • this will reduce the memory needed by a vector in R? – user5359531 Oct 13 '16 at 00:02
  • @user5359531 Absolutely! By at least 10 fold. It's making the size of every aspect of the smaller by sampling. I got the same error as OP with the full data and no error with any reasonable sample size. I wouldn't have been able to get that result plot otherwise (tho it's not too pretty anyway). – Hack-R Oct 13 '16 at 00:03
  • that is not reducing the memory requirement of a vector, that is simply making a different smaller vector – user5359531 Oct 13 '16 at 00:06
  • @user5359531 ?? Uh, no it absolutely reduces the size of the problematic vector as well and from a Data Science / statistics and Tidy data perspective this is the best practice. – Hack-R Oct 13 '16 at 00:08
  • sounds like an [XY problem](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) the solution offered by @Hack-R solves your actual main problem (X). But you are looking for a solution to Y which is not the correct way to approach X – dww Oct 13 '16 at 00:10
  • 1
    @Hack-R I'm comparing the original to the subsampled dataset now, running off 10,000, and I'm seeing some pretty striking differences. Am I correct in assuming (unfamiliar with sample_n) that what is happening is it's making a new data frame of 10000 observations, instead of the full 48,000? So it's not making it more memory efficient, just running it on a smaller piece of data which therefore requires less memory? right? – Jesse001 Oct 13 '16 at 00:14
  • @dww could you please clarify your point? What the data actually are is 1000 iterations of multiple modeled scenarios. There are a large number of scenarios, meaning the size of the data frame gets large very quickly. – Jesse001 Oct 13 '16 at 18:58