I have a data frame resulting from a query to an SQL server using RMySQL and applying dplyr to it afterwards.
After I use subset()
on it, the resulting subsetted data frame takes the same space as the original.
The subset has ~10% of the rows of the original data frame. I saved the data frame to a CSV file and loaded it again, then it had 10% of the size as I'd expected.
# include dplyr and RMySQL, setup connection... etc.
df = query("SELECT created_at FROM requests")
requests$created_at %>%
as.POSIXlt %>%
cut.POSIXt(breaks="sec") %>%
table %>% as.data.frame -> df
colnames(df) <- c('created_at', 'requests')
dfss <- subset(df, requests > 3)
Now, the memory usage shows as:
Type Size Rows Columns
df data.frame 455869312 5180320 2
dfss data.frame 414427000 13 2
And after doing anything like dfss$requests <- 1, I still get:
Type Size Rows Columns
df data.frame 455869312 5180320 2
dfss data.frame 414427000 13 2
If I truncate the table by using head(df, 10000) and then do the whole thing again, I get similar behaviour, with the subset being just a little smaller than the original set, even though it has only a few rows:
Type Size Rows Columns
df data.frame 20199008 229521 2
dfss data.frame 18440576 9718 2
What is going on here?