1

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:

for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}

This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?

Raj Raina
  • 211
  • 1
  • 11
  • 3
    can you dput some of your data with NULL value? it's not rather NA? – Colonel Beauvel Jul 22 '15 at 21:02
  • 1
    Try `df[colSums(is.null(df))==0,]` (not tested). You would also need to reassign df in your loop, otherwise, those rows are not deleted. Generally, `for` loops are not very fast in R, especially when applied to each row in a 15x1e6 table.. – talat Jul 22 '15 at 21:02
  • 1
    Your code isn't working. Are you assigning df[-i,] to anything? Even if you do, think what is going to happen if you delete row1, when i becomes 2? – jeremycg Jul 22 '15 at 21:03
  • This seem like an poor way to filter rows. It would be better if you took a step back and better described what you are trying to do. Include a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output. Start small. – MrFlick Jul 22 '15 at 21:13
  • 1
    ... of course I meant `df[rowSums(is.null(df))==0,]`, not `colSums`.. but apparently, that doesn't work (tested on mtcars) – talat Jul 22 '15 at 21:17

1 Answers1

1

The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.

This syntax should give you a nice speed boost. This will take advantage of rowSums producing NA for every row that has missing values in it.

df<-subset(df, !is.na(rowSums(df[,1:10])))

This syntax should also work.

df<-df[rowSums(is.na(df[,1:10]))==0,]
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • @user3000877 if this explanation answers your question would you consider accepting it. If not, could you let us know what extra clarification could be provided? – Dean MacGregor Jul 29 '15 at 21:34
  • 1
    The looping isn’t the slow part here. What’s slow is the constant copying of a whole data frame (minus one row), which is what I’m assuming is happening (although OP’s code doesn’t show this). – Konrad Rudolph Jul 29 '15 at 21:50