-1

I am running the below function on dataframe combo4:

elecOrGas<-function(myData)
{
  for (i in 1:(nrow(myData)-1))
  {
    if (myData[i,2]==myData[i+1,2]) 

    {
      if ((myData$typeGas[i]==myData$typeElec[i+1])|(myData$typeElec[i]==myData$typeGas[i+1]))
      {
        myData$typeTest[i]=1
      } else { myData$typeTest[i]=0}
    } else { myData$typeTest[i]=0}
  } 
  return(myData)
} 

combo4 dataframe consists of 4 columns in below format with ~800K rows

 CUSTID typeGas typeElec typeTest
12456   1        0         0
12563   1        0         1
12563   0        1         0
12455   0        1         0  

When i run the function elecOrGas(combo4). It takes forever to run the code. I think I am doing something wrong here. Please assist.

aseem bhartiya
  • 94
  • 1
  • 10
  • 1
    can you describe what your loop is trying to do? – agenis Mar 10 '16 at 18:42
  • Shouldn't customer ID `12455` get `1` for `typeTest` since it is a repeat? – Pierre L Mar 10 '16 at 18:43
  • What are your data dimensions? Add some debugging statements (every 10 or 100 rows or so, try `message`) to see if it's calculating. – Roman Luštrik Mar 10 '16 at 19:24
  • I am trying to see if customer is both gas and electric type. I have sorted CUSTID, then I am checking first of two consecutive CUSTID are same. If CUSTI are same I am checking if customer has borh Elec and Gas service . If he/she has both I am assigning typetest as 1. – aseem bhartiya Mar 10 '16 at 19:32

1 Answers1

0

Here's a solution using dplyr, which is great for handling this kind of problem. I created some simulated data matching your example:

library(dplyr)

## fake test data set
combo.test <- data.frame(
    CUSTID = sample(rep(10000:999999, each=2), 800000, replace = F),
    typeGas = sample(c(0,1), 800000, replace = T)
)
combo.test$typeElec <- ifelse(combo.test$typeGas == 0, 1, 0) 

To assign "1" to typeTest if a customer is 1 for both typeElec and typeGas in (possibly) different rows, you use the dplyr "group_by" function to loop over each distinct CUSTID in your data.frame, then "mutate" to create a new variable "typeTest". "ifelse" tests if "any" values are 1 in either the typeElec or typeGas column for that CUSTID.

# convert to tbl_df object, arrange by CUSTID, assign 1 to variable typeTest
#   if CUSTID has values for 1 in both typeGas and typeElec
ptm <- proc.time() 
combo.test <- combo.test %>% tbl_df() %>% arrange(CUSTID) %>% 
    group_by(CUSTID) %>% 
    mutate(typeTest = ifelse(any(typeGas == 1) & any(typeElec == 1), 1, 0)) %>%
    ungroup()
proc.time() - ptm

"tbl_df()" converts the data.frame to a nice dplyr version, and the pipe "%>%" operators denote the output from each function is passed to the next. The code took ~ 10 sec to run for me.

https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

UPDATE: right, I should have answered your original question instead of giving an alternative method. There was only one bug in your function (line 3 should've indexed column 1 instead of column 2, for CUSTID). The speed problem has to do with the efficiency with which R handles vectors vs. data.frames. Here's a good discussion: (Speed up the loop operation in R).

elecOrGas2 <-function(myData) {
    res <- numeric(nrow(myData))  # initialize a vector for 'typeTest'

    for (i in 1:(nrow(myData)-1)) {
        #if (myData[i,2]==myData[i+1,2])  
        if (myData[i,1]==myData[i+1,1])  { # correct index for CUSTID 
            if ((myData$typeGas[i]==myData$typeElec[i+1])|
                    (myData$typeElec[i]==myData$typeGas[i+1])) {
                res[i] <- 1  # use 
                #myData$typeTest[i]=1
            } else { 
                res[i]=0 
            }
        } else { 
            res[i]=0 
        }
    } 
    myData$typeTest <- res
    return(myData)
} 

library(dplyr)
combo.test <- data.frame(
    CUSTID = sample(rep(10000:999999, each=2), 800000, replace = F),
    typeGas = sample(c(0,1), 800000, replace = T)
)
combo.test$typeElec <- ifelse(combo.test$typeGas == 0, 1, 0)     
combo.test <- arrange(combo.test, CUSTID) %>% tbl_df()

# test time using 1/10 of the data
# original function: 29 sec
system.time(elecOrGas(combo.test[1:80000,]) -> test1)  
# updated vectorized function: 6 sec
system.time(elecOrGas2(combo.test[1:80000,]) -> test2)
Community
  • 1
  • 1
Lorenz D
  • 576
  • 4
  • 5