1

Background:

Find the number of times companies move in tandem regarding their Market Capitalization, so for example with company A and company B, they move 3 times together and I would like to divide it when both company A and B show values different than NA (here, 10). I used a logical formula, TRUE when they have the same letter, FALSE when they do not have the same letter and NA when there is one NA value in A or B.

THE PROBLEM IS:

The code I used works with small sets, max 50 companies, then it takes too much time, and I am looking to do it for sets of 100 companies, approxi. 324.000.000 data points

Input (small subset): Dataframe "dat"

           CompA CompB CompC CompD 
    1         A    F <NA>    A
    2         A    F <NA>    F
    3         F    E <NA>    A
    4         A    A <NA>    A
    5         F    A <NA>    F
    6         A    D <NA>    D
    7         F    F <NA>    B
    8         A    A <NA>    F
    9         F    E <NA>    F
    10       <NA>  C <NA>    A
    11        E    F <NA>    E

Code used:

  v <- NULL
  i <- 1
  j <- 1

  for(i in 1:length(dat)-1){

  j <- i+1
    while(j <= length(dat)-1){
  str(dat)

       qone <- data.frame(qone = 
         (as.character(dat[,i+1])==as.character(dat[,j+1])))

     count1 <- length(which(qone == TRUE))/(length(which(qone == 
  TRUE))+length(which(qone == FALSE)))

    v <- append(v, count1)
   v <- data.frame(v)

   j <- j+1
   }}

Final output:

        x1     x2     x3    x4     x5    x6
    1   0.3    NA     0.5   NA    0.27   NA

Second Final output: 1 Nb TRUE 2 Nb FALSE

         x1     x2     x3    x4     x5    x6
    1     3    0       5     0      3     0
    2     7    0       6     0      8     0
Robin_Hcp
  • 309
  • 2
  • 15
  • 3
    You're growing object within loops which causes the problem. Take a look at these links to learn more: http://r-statistics.co/Strategies-To-Improve-And-Speedup-R-Code.html & http://www.dartistics.com/fast-r-code.html – Tung Apr 11 '18 at 06:07
  • Thank you, I will take a look at it @Tung – Robin_Hcp Apr 11 '18 at 06:08
  • 2
    Quite a few things to speed up here; Tung's comment is excellent, but also see: https://stackoverflow.com/questions/29402528/append-data-frames-together-in-a-for-loop – Marcus Campbell Apr 11 '18 at 06:08
  • See also: http://winvector.github.io/Accumulation/ – Tung Apr 11 '18 at 06:09

1 Answers1

2

Using loops in R is generally inefficient. Since you are growing the data frame inside the nested loop, it's going to slow down things heavily

Try the following approach :

library(data.table)
#Create the dummy data

companyData <- fread("~/test_data.csv",sep = "\t",na.strings = "<NA>")

#Two apply function to cross-over other columns over each column
v <- lapply(companyData, function(leftcomp) {
  lapply(companyData, function(rightcomp) {
    mean(leftcomp == rightcomp, na.rm = T)
  })})

#Unlist data to get n*n vector which has all the values
results <- unlist(v)

#Some logic to collect the required elements only.
l <- length(companyData)
a <- 1:(l*l)
b <- rep(seq(1,l*l,by = l+1),times = rep(l,times = l))

log_vec <- a > b
# 
# > log_vec
# [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

final_result <- results[log_vec]

# > final_result
# CompA.CompB CompA.CompC CompA.CompD CompB.CompC CompB.CompD CompC.CompD 
# 0.3000000         NaN   0.5000000         NaN   0.2727273         NaN 

I tried to create a dummy data frame with ~200,000 rows and ~80 columns and run the above code, took nearly minute on an average laptop.

More information on the function used :

The first lapply takes each column of companyData and passes to second lapply. This column is compared with each column of companyData again.

mean(leftcomp == rightcomp, na.rm = T) leftcomp == rightcomp gives a logical vector with TRUE wherever the companies are in tandem. It returns and NA when any of the columns are NA.

Now, mean of this logical vector after ignoring the NAs would give us the ratio. This works because TRUE is essentially 1 and FALSE is 0 in R.

Eg:

>log_vec
[1] TRUE FALSE FALSE FALSE NA TRUE 

mean(log_vec,na.rm = T) returns 2/5 = 0.4.

Now, you wanted to select the elements which were unique (CompA vs CompB but not CompB vs CompA).

For that,

If l is the number of companies you're looking at (say, 4), a <- 1:(l*l) creates an index of all the computations lapply did above.

> a
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16

b is a vector as below :

> b
 [1]  1  1  1  1  6  6  6  6 11 11 11 11 16 16 16 16

log_vec <- a > b

> log_vec
 [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

This vector has TRUE for the elements you need to pick.

Hope, it's more clear now.

Dave
  • 66
  • 3
  • Also, shouldn't the value of `x5` in your sample outcome be 0.2727? – Dave Apr 11 '18 at 16:21
  • Thanks a lot ! @Dave ! Yes, I got the wrong data in the question sorry.. I get this error: Error in Ops.factor(leftcomp, rightcomp) : level sets of factors are different – Robin_Hcp Apr 12 '18 at 04:47
  • And I have one more question if you don't mind.. If I wanted to count the number of TRUE and FALSE for each company to see if they are statistically significant, I edited the question with the output I would be looking for @Dave – Robin_Hcp Apr 12 '18 at 06:01
  • 1
    I will try to add more detail on the answer today evening. Can you elaborate more on the statistical significance question? What's your hypothesis? Sorry, but I don't see any edits in the question or output. – Dave Apr 12 '18 at 06:06
  • Sorry it did not update.. By statistical significance I mean just counting the number of TRUE and FALSE for each company pair, for example for company A & B we have 3 TRUE and 7 FALSE – Robin_Hcp Apr 12 '18 at 06:17