Memory usage of R dataframes

Question

I have a data frame resulting from a query to an SQL server using RMySQL and applying dplyr to it afterwards.

After I use subset() on it, the resulting subsetted data frame takes the same space as the original.

The subset has ~10% of the rows of the original data frame. I saved the data frame to a CSV file and loaded it again, then it had 10% of the size as I'd expected.

# include dplyr and RMySQL, setup connection... etc.

df = query("SELECT created_at FROM requests")

requests$created_at %>% 
  as.POSIXlt %>% 
  cut.POSIXt(breaks="sec") %>%
  table %>% as.data.frame -> df

colnames(df) <- c('created_at', 'requests')

dfss <- subset(df, requests > 3)

Now, the memory usage shows as:

                    Type      Size    Rows Columns
df            data.frame 455869312 5180320       2
dfss          data.frame 414427000      13       2

And after doing anything like dfss$requests <- 1, I still get:

                    Type      Size    Rows Columns
df            data.frame 455869312 5180320       2
dfss          data.frame 414427000      13       2

If I truncate the table by using head(df, 10000) and then do the whole thing again, I get similar behaviour, with the subset being just a little smaller than the original set, even though it has only a few rows:

                 Type     Size   Rows Columns
df         data.frame 20199008 229521       2
dfss       data.frame 18440576   9718       2

What is going on here?

Some good ideas here: http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session — Stedy, Nov 25 '15 at 15:48
Actually, the resulting subset takes (almost) zero memory. The second object is not a copy. The copy will only take place if you modify the first object. — Andrie, Nov 25 '15 at 16:01
Andrie, I tried dfss$x = 10 to change its size and it did not work. Any ideas? — Rodrigo Stv, Nov 25 '15 at 16:08
This is not reproducible. We don't have access to your data, nor do we know how you measure the object size. Please use a built-in dataset, public data or randomly generated data. Also show all your code. — Andrie, Nov 25 '15 at 17:00

score 2 · Answer 1 · answered Nov 25 '15 at 16:06

2

You have not provided a reproducible example but I'm pretty confident you're simply wrong. R will make copies via pointers to the same memory address. Subsets, which represent modified data, will require a subset of the original data size.

See below:

library(pryr)
mem_used()

x <- matrix(1:10^6, 10^5)
mem_used()
y <- x
mem_used() # copy via pointer, no new memory allocation

y2 <- x[sample(1:nrow(x), 10^4,replace=F),]
mem_used() # only requires a fraction of the memory of x

y3 <- subset(x, sample(c(T,F), nrow(x), T, prob= c(.1,.9)))
mem_used() # only requires a fraction of the memory of x

object_size(x) 
object_size(y)
object_size(y2)
object_size(y3)

answered Nov 25 '15 at 16:06

alexwhitworth

4,839
5
32
59

I'll be more specific, then (see question again pls) – Rodrigo Stv Nov 25 '15 at 16:10
1

@RodrigoStv your question is not [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and I fail to see any new insight in your question. Ask a better question and get a more specific answer. – alexwhitworth Nov 25 '15 at 16:16
Had no idea about pryr, interesting! – Stedy Nov 25 '15 at 16:19
I tried to make it clearer, but it depends on a SQL connection to a database so I guess it has side effects and is not purely reproducible – Rodrigo Stv Nov 25 '15 at 16:20
Your question is still not reproducible. I'm assuming your un-shown function call is to Dirk's `lsos` as linked in the comments above, but it's unclear. – alexwhitworth Nov 25 '15 at 16:30

Memory usage of R dataframes

1 Answers1