0

I have a data frame resulting from a query to an SQL server using RMySQL and applying dplyr to it afterwards.

After I use subset() on it, the resulting subsetted data frame takes the same space as the original.

The subset has ~10% of the rows of the original data frame. I saved the data frame to a CSV file and loaded it again, then it had 10% of the size as I'd expected.

# include dplyr and RMySQL, setup connection... etc.

df = query("SELECT created_at FROM requests")

requests$created_at %>% 
  as.POSIXlt %>% 
  cut.POSIXt(breaks="sec") %>%
  table %>% as.data.frame -> df

colnames(df) <- c('created_at', 'requests')

dfss <- subset(df, requests > 3)

Now, the memory usage shows as:

                    Type      Size    Rows Columns
df            data.frame 455869312 5180320       2
dfss          data.frame 414427000      13       2

And after doing anything like dfss$requests <- 1, I still get:

                    Type      Size    Rows Columns
df            data.frame 455869312 5180320       2
dfss          data.frame 414427000      13       2

If I truncate the table by using head(df, 10000) and then do the whole thing again, I get similar behaviour, with the subset being just a little smaller than the original set, even though it has only a few rows:

                 Type     Size   Rows Columns
df         data.frame 20199008 229521       2
dfss       data.frame 18440576   9718       2

What is going on here?

Rodrigo Stv
  • 405
  • 3
  • 11
  • Some good ideas here: http://stackoverflow.com/questions/1358003/tricks-to-manage-the-available-memory-in-an-r-session – Stedy Nov 25 '15 at 15:48
  • Actually, the resulting subset takes (almost) zero memory. The second object is not a copy. The copy will only take place if you modify the first object. – Andrie Nov 25 '15 at 16:01
  • Yes, see below answer – alexwhitworth Nov 25 '15 at 16:07
  • Andrie, I tried dfss$x = 10 to change its size and it did not work. Any ideas? – Rodrigo Stv Nov 25 '15 at 16:08
  • Show your code and also how you compute the size – Andrie Nov 25 '15 at 16:09
  • This is not reproducible. We don't have access to your data, nor do we know how you measure the object size. Please use a built-in dataset, public data or randomly generated data. Also show all your code. – Andrie Nov 25 '15 at 17:00

1 Answers1

2

You have not provided a reproducible example but I'm pretty confident you're simply wrong. R will make copies via pointers to the same memory address. Subsets, which represent modified data, will require a subset of the original data size.

See below:

library(pryr)
mem_used()

x <- matrix(1:10^6, 10^5)
mem_used()
y <- x
mem_used() # copy via pointer, no new memory allocation

y2 <- x[sample(1:nrow(x), 10^4,replace=F),]
mem_used() # only requires a fraction of the memory of x

y3 <- subset(x, sample(c(T,F), nrow(x), T, prob= c(.1,.9)))
mem_used() # only requires a fraction of the memory of x

object_size(x) 
object_size(y)
object_size(y2)
object_size(y3)
alexwhitworth
  • 4,839
  • 5
  • 32
  • 59
  • I'll be more specific, then (see question again pls) – Rodrigo Stv Nov 25 '15 at 16:10
  • 1
    @RodrigoStv your question is not [reproducible](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and I fail to see any new insight in your question. Ask a better question and get a more specific answer. – alexwhitworth Nov 25 '15 at 16:16
  • Had no idea about pryr, interesting! – Stedy Nov 25 '15 at 16:19
  • I tried to make it clearer, but it depends on a SQL connection to a database so I guess it has side effects and is not purely reproducible – Rodrigo Stv Nov 25 '15 at 16:20
  • Your question is still not reproducible. I'm assuming your un-shown function call is to Dirk's `lsos` as linked in the comments above, but it's unclear. – alexwhitworth Nov 25 '15 at 16:30