2

The goal of the below code is to perform recursive and iterative analysis on a data set that has 400 columns and 6000 rows. It takes, two columns at a time and performs analysis on it, before moving to all the possible combinations.

Small sub set of large data set being used:

  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.71397
-0.710585   -0.714491   -0.710223   -0.713885
-0.710414   -0.714092   -0.710166   -0.71434
-0.711255   -0.714116   -0.70945    -0.714173
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644

Code using apply():

# Function
analysisFunc <- function () {

    # Fetch next data to be compared
    nextColumn <<- currentColumn + 1

    while (nextColumn <= ncol(Data)){

        # Fetch the two columns on which to perform analysis
        c1 <- Data[, currentColumn]
        c2 <- Data[, nextColumn]

        # Create linear model
        linearModel <- lm(c1 ~ c2)

        # Capture model data from summary
        modelData <- summary(linearModel)

        # Residuals
        residualData <- t(t(modelData$residuals))

        # Keep on appending data
        linearData <<- cbind(linearData, residualData)

        # Fetch next column
        nextColumn <<- nextColumn + 1

    }

    # Increment the counter
    currentColumn <<- currentColumn + 1

}

# Apply on function
apply(Data, 2, function(x) analysisFunc ())

I thought instead of using loops, apply() will help me optimize the code. However, it seems to have no major effect. Run time is more than two hours.

Does anyone think, I am going wrong on how apply() has been used? Is having while() within apply() call not a good idea? Any other way I can improve this code?

This is first time I am working with functional programming. Please let me know your suggestion, thanks.

Chetan Arvind Patil
  • 854
  • 1
  • 11
  • 31
  • 1
    `data.table` might be worth looking into. – BLT Jul 03 '17 at 22:44
  • When you are using the apply function, have you re-wrote the analysisFunc such that it will work on a single column? – kangaroo_cliff Jul 03 '17 at 22:48
  • It would be great if you provided a small data set, and the 'code for analysis' bit, as there could be many ways to improve your performance if we knew what the overall goal is. – SymbolixAU Jul 03 '17 at 22:54
  • @DiscoSuperfly - If I do that, will `apply()` fetch the `nextColumn` on its own? Is that what you are suggesting? The code shown is the only logic, I have tried so far. – Chetan Arvind Patil Jul 03 '17 at 22:57
  • 1
    Looking at your [previous questions](https://stackoverflow.com/users/8094899/chetan-arvind-patil), they all appear to be around 'improving loops'. I think you're approaching this the wrong way. I think you'd do better to detail what your overall goal is, and an example of the data you're working with. – SymbolixAU Jul 03 '17 at 23:04
  • @SymbolixAU - I have updated the question, please check. – Chetan Arvind Patil Jul 03 '17 at 23:09
  • Are you trying to perform a `lm` on every combination of 2 columns, for every row? – SymbolixAU Jul 03 '17 at 23:10
  • 1
    @SymbolixAU - Yes. I take two columns at a time, then do `lm()` on it. This gives me `$residuals` for each of the rows (two data points from each column), which I want to capture. This I then repeat for each of the possible column combination in the data set. – Chetan Arvind Patil Jul 03 '17 at 23:15
  • 1
    The problem might be because of the loop within the function which causes bottleneck. Thus, even if you use apply, it still goes through a loop. Why not make a list of all possible combinations of columns (variables), then use that list in an apply construct (maybe sapplyor lapply). – din Jul 04 '17 at 00:54
  • Might come from available memory. It takes less than 20 minutes on my desktop with 16GB. You can speed up using `lm.fit`or `.lm.fit` that are faster than `lm`. You should avoid using `<<-`. Parfait's code is good example of how to program with R – HubertL Jul 04 '17 at 01:16
  • @din - Thanks. I guess your suggested method will also work, as does Parfait's answer. – Chetan Arvind Patil Jul 04 '17 at 05:11
  • @HubertL - Memory can be the issue, I have 8GB only. Did you expand the data to 6000x400? I gave only small data set as an example above. I too wanted to avoid `<<-` , but still getting hang of R programming. – Chetan Arvind Patil Jul 04 '17 at 05:14
  • 1
    @ChetanArvindPatil All good. It actually has the same logic with Parfait answer -- which essentially get rids of the loop. In addition, you might want to consider using microsoft R as well. I found it faster especially when working bayesian sampling. – din Jul 04 '17 at 05:20
  • 1
    I used `Data <- sapply(1:400, rnorm, n=6000)` to generate the dataset that I then used with your code (declaring `currentColumn <- 1;linearData <- data.frame(id=1:6000)` ). And it takes 6 minutes only with `linearModel <- lm.fit(as.matrix(c2, nrow=6000), as.matrix(c1, nrow=6000));residualData <- linearModel$residuals` – HubertL Jul 04 '17 at 18:55
  • @HubertL - Thanks. I need to dig `lm.fit()` more. – Chetan Arvind Patil Jul 05 '17 at 14:59

1 Answers1

6

Consider an expand.grid of column names and then using mapply the multiple input version of apply family where you pass two+ vectors/lists and run a function across each input elementwise. With this approach you avoid expanding vectors within looping and running an inner while loop:

Data

Data <- read.table(text="  data1       data2       data3      data4
-0.710003   -0.714271   -0.709946   -0.713645
-0.710458   -0.715011   -0.710117   -0.714157
-0.71071    -0.714048   -0.710235   -0.713515
-0.710255   -0.713991   -0.709722   -0.71397
-0.710585   -0.714491   -0.710223   -0.713885
-0.710414   -0.714092   -0.710166   -0.71434
-0.711255   -0.714116   -0.70945    -0.714173
-0.71097    -0.714059   -0.70928    -0.714059
-0.710343   -0.714576   -0.709338   -0.713644", header=TRUE)

Process

# Data frame of all combinations excluding same columns 
modelcols <- subset(expand.grid(c1=names(Data), c2=names(Data), 
                    stringsAsFactors = FALSE), c1!=c2)

# Function
analysisFunc <- function(x,y) {        
      # Fetch the two columns on which to perform analysis
      c1 <- Data[[x]]
      c2 <- Data[[y]]

      # Create linear model
      linearModel <- lm(c1 ~ c2)

      # Capture model data from summary
      modelData <- summary(linearModel)

      # Residuals
      residualData <- modelData$residuals
}

# Apply function to return matrix of residuals
linearData <- mapply(analysisFunc, modelcols$c1, modelcols$c2)
# re-naming matrix columns
colnames(linearData) <- paste(modelcols$c1, modelcols$c2, sep="_")

Output

    data2_data1   data3_data1   data4_data1   data1_data2   data3_data2   data4_data2
1  1.440828e-04  8.629813e-05  1.514109e-04  5.583917e-04 -0.0001205821  2.866488e-04
2 -6.949384e-04 -2.508770e-04 -2.487813e-04 -1.005367e-04 -0.0001263202 -2.145225e-04
3  2.132192e-04 -4.609125e-04  4.551430e-04 -8.715424e-05 -0.0004593840  4.133856e-04
4  3.692403e-04  2.182627e-04 -1.116648e-04  3.835538e-04  0.0000408864 -4.244855e-05
5 -2.025772e-04 -4.032600e-04  5.442655e-05 -8.423568e-05 -0.0003484501  4.986815e-05
6  2.336373e-04 -2.838073e-04 -4.425935e-04  1.967203e-04 -0.0003805576 -4.109706e-04
7  2.661145e-05  1.250425e-04 -6.893342e-05 -6.508936e-04  0.0003408023 -2.436194e-04
8  1.456357e-04  3.991303e-04 -2.496687e-05 -3.501856e-04  0.0004980726 -1.304535e-04
9 -2.349110e-04  5.701233e-04  2.359596e-04  1.343401e-04  0.0005555326  2.921120e-04
    data1_data3   data2_data3   data4_data3   data1_data4   data2_data4   data3_data4
1  5.121547e-04  4.313395e-05  2.829814e-04  4.232081e-04  1.795365e-05 -9.584175e-05
2 -1.649379e-06 -6.684696e-04 -2.349827e-04  1.975728e-04 -7.112598e-04 -3.014160e-04
3 -2.942277e-04  3.141257e-04  4.029018e-04 -3.420290e-04  2.382149e-04 -3.760631e-04
4  3.371847e-04  2.859362e-04 -3.420612e-05  3.168009e-04  3.048006e-04  1.062117e-04
5 -1.651011e-04 -1.308671e-04  3.332034e-05 -5.127719e-05 -1.969902e-04 -3.890484e-04
6  2.550032e-05  2.586674e-04 -4.196917e-04  3.235528e-04  2.115955e-04 -3.627735e-04
7 -5.692790e-04  1.157675e-04 -2.277195e-04 -5.922595e-04  1.840773e-04  3.645036e-04
8 -2.258187e-04  1.445371e-04 -1.077903e-04 -3.583290e-04  2.386756e-04  5.422018e-04
9  3.812360e-04 -3.628313e-04  3.051868e-04  8.276013e-05 -2.870674e-04  5.122258e-04
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks @Parfait. Your solution is elegant. I tried it for a larger data set of 6000x100. I get `warnings()`: `In summary.lm(linearModel) : essentially perfect fit: summary may be unreliable`, what do you think about this? The real data set is 6000x400. – Chetan Arvind Patil Jul 04 '17 at 05:16
  • 3
    @ChetanArvindPatil You are supposed to hit the tick near Parfait's answer if it had solve the question you asked. Your present problem related to the data you have. – kangaroo_cliff Jul 04 '17 at 09:55
  • @ChetanArvindPatil - did you `subset` out the same column names as done after `expand.grid`? If not you are attempting to regress a variable on itself which is perfect collinearity and hence the warning. Another reason is some column pairings are not exactly the same but linearly related, maybe by some formula. Check back at data source and exclude as needed. – Parfait Jul 04 '17 at 12:45
  • @DiscoSuperfly - I was waiting whether someone else can better Parfait's answer. As Stack Overflow [suggests](https://stackoverflow.blog/2009/01/06/accept-your-own-answers/) waiting 48 hours before _accepting_ an answer. – Chetan Arvind Patil Jul 05 '17 at 15:09
  • @Parfait - Thanks. Seems to be the case, it's the data problem. Also, when I use the full data set of 6000x400, I run into memory issues: `Error: cannot allocate vector of size 442.0 Mb`. Based on [this answer](https://stackoverflow.com/questions/10917532/r-memory-allocation-error-cannot-allocate-vector-of-size-75-1-mb), it seems my i7 8GB system is not enough. Is this because of R's limitation on how it handles data? I never encountered such issue with MATLAB where the data set was much larger than what I am using currently. – Chetan Arvind Patil Jul 05 '17 at 15:45
  • 1
    Are you using a 32-bit or 64-bit machine? The former is limited to 3GB regardless of availability. For latter, try increasing mem: `memory.limit(size=56000)`. Try closing other apps. And try using command line version Rscript over RStudio with code that outputs matrix to disk as .RData with `saveRDS()`. You can then use data in other later scripts with `readRDS()`. – Parfait Jul 05 '17 at 16:50
  • 1
    @Parfait - I am on 64-bit machine, and using RStudio. Your suggestion of `memory.limit(size=56000)` worked out. Thanks. – Chetan Arvind Patil Jul 05 '17 at 19:44
  • @Parfait - Using `subset()`, how can I create combinations that exclude the same combination happening twice? For example, c1 and c2 column will appear twice in above code. First as `c1-c2` and then as `c2-c1`. I just want it to appear ones either as `c1-c2` or as `c2-c1`. Any suggestions? – Chetan Arvind Patil Oct 10 '17 at 21:54
  • In `subset()` change the criteria from `c1!=c2` to `c1 < c2` – Parfait Oct 10 '17 at 21:56
  • @Parfait - Thank you. And that will also remove same column stacked against itself as it was being avoided with `c1!=c2`? – Chetan Arvind Patil Oct 10 '17 at 21:58
  • Sure it will, `c1` will not ever be equal to `c2` with that logic. – Parfait Oct 10 '17 at 21:58
  • @Parfait - In `R` this works fine, when I switch to `TERR` I get `Error: out of memory` after I execute `mapply()` code above. `modelcols` size is just `74691` for my data set. Does `R` or `TERR` compute memory requirement before hand? `memory.limit()` is `16048` – Chetan Arvind Patil Oct 10 '17 at 22:01
  • What is `TERR`? You might need to ask a different question with reproducible example. – Parfait Oct 10 '17 at 22:02
  • @Parfait - Sure. This is [TERR](https://docs.tibco.com/products/tibco-enterprise-runtime-for-r). Same as `R` but with some optimization (not sure what exactly). – Chetan Arvind Patil Oct 10 '17 at 22:03
  • It might be an environment issue with that product as R code clearly works. Consult the vendors or your IT. – Parfait Oct 10 '17 at 22:04