R: Why does it take so long to parse this data table

Question

I have a data frame df that has 15 columns and 1000000 rows of all ints. My code is:

for(i in 1:nrow(df))
{
if(is.null(df$col1[i]) || .... || is.null(df$col9[i]))
df[-i,] #to delete the row if one of those columns is null
}

This has been running for an hour and is still going. Why? It seems like it should be relatively fast code to run. How can I speed it up?

can you dput some of your data with NULL value? it's not rather NA? — Colonel Beauvel, Jul 22 '15 at 21:02
Try `df[colSums(is.null(df))==0,]` (not tested). You would also need to reassign df in your loop, otherwise, those rows are not deleted. Generally, `for` loops are not very fast in R, especially when applied to each row in a 15x1e6 table.. — talat, Jul 22 '15 at 21:02
Your code isn't working. Are you assigning df[-i,] to anything? Even if you do, think what is going to happen if you delete row1, when i becomes 2? — jeremycg, Jul 22 '15 at 21:03
This seem like an poor way to filter rows. It would be better if you took a step back and better described what you are trying to do. Include a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output. Start small. — MrFlick, Jul 22 '15 at 21:13
... of course I meant `df[rowSums(is.null(df))==0,]`, not `colSums`.. but apparently, that doesn't work (tested on mtcars) — talat, Jul 22 '15 at 21:17

Dean MacGregor · Answer 1 · 2015-07-29T21:33:48.720

1

The reason it is slow is that R is relatively slow at looping through vectors. Most functions in R are vectorized which means you can perform them on a vector at once much faster than it can loop through each element one by one. On a side note, I don't think you have NULLs in your data frame. I think you have NAs so I'm going to assume that is what you have. Even if you have NULLs then the following should still work.

This syntax should give you a nice speed boost. This will take advantage of rowSums producing NA for every row that has missing values in it.

df<-subset(df, !is.na(rowSums(df[,1:10])))

This syntax should also work.

df<-df[rowSums(is.na(df[,1:10]))==0,]

edited Jul 29 '15 at 21:33

answered Jul 22 '15 at 21:56

Dean MacGregor

11,847
9
34
72

@user3000877 if this explanation answers your question would you consider accepting it. If not, could you let us know what extra clarification could be provided? – Dean MacGregor Jul 29 '15 at 21:34
1

The looping isn’t the slow part here. What’s slow is the constant copying of a whole data frame (minus one row), which is what I’m assuming is happening (although OP’s code doesn’t show this). – Konrad Rudolph Jul 29 '15 at 21:50

R: Why does it take so long to parse this data table

1 Answers1