-8

I just want to calculate the maximum values for each column separately. Using simple sapply went into a memory overflow:

 # dt is my data.table object
 res <- sapply(dt, max, na.rm=T) # fails due to memory problems

It is a sparse table of 1 million rows and 1000 columns, with an overall size of 11 GB.

I am working on the file train_date.csv and use the following lines of code:

require(data.table)
dtDate <- fread(paste0(filePath, "train_date.csv"))
dim(dtDate)
require(pryr)
object_size(dtDate)
honk
  • 9,137
  • 11
  • 75
  • 83
CodingButStillAlive
  • 733
  • 2
  • 8
  • 22
  • 2
    What is exactly the code you've used? What is the data size? What are your specs? Did you read [this](https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html) ? There are examples of *idiom* for `lapply` usage and much more. – David Arenburg Dec 08 '16 at 12:30
  • 1
    Maybe, it will be better to use `apply(df, 2, max)`. – Istrel Dec 08 '16 at 12:31
  • It is a sparse table of 1 mio rows and 1000 columns, with an overall size of 11 GB @DavidArenburg. – CodingButStillAlive Dec 08 '16 at 12:38
  • This is not how you run `data.table` code. You should read the intro I've linked above – David Arenburg Dec 08 '16 at 12:53
  • The answer of mpjdem follows this recommendation from the data.table FAQ page: https://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.html#how-can-i-avoid-writing-a-really-long-j-expression-youve-said-that-i-should-use-the-column-names-but-ive-got-a-lot-of-columns. – CodingButStillAlive Dec 08 '16 at 13:15
  • What I really don't like about packages like sparklyr, data.table, and so on is that they assume analysis scenarios where you can refer to each of your columns by names. This is seldomly the case for high-dimensional data. I really wonder what data analysis scenarios the developers had in mind. Most of the packages fail on simple things like doing a function call to each column, if you have 1000 columns. This is the same problem as with sparklyr. – CodingButStillAlive Dec 08 '16 at 13:19
  • Re your comment, I hear that finance and genetics have large data sets where ordering and groups (which is what many data.table features are built around) matter. And I use it with smallish data sets for the nice syntax. If you are using a sparse matrix, you should google a package designed for such a thing rather than shoehorning it into spark or data.table, I guess. Also, you could google column maxes in R to find `colMaxs`. – Frank Dec 08 '16 at 14:14
  • @Felix: thanks! Your assumption is right. I am originally a Bioinformatician, but this file includes sensor data in the industry 4.0 field. Thanks for the tipps! – CodingButStillAlive Dec 08 '16 at 14:55

1 Answers1

1

Warning, a very large table will be created!

dt <- as.data.table(matrix(runif(1000*1000000),ncol=1000))
dt[,lapply(.SD,max)]
mpjdem
  • 1,504
  • 9
  • 14
  • Thanks, but this doesn't work. It resulted in a data.table with data.frames included. – CodingButStillAlive Dec 08 '16 at 12:42
  • Then something is wrong with your setup or your data. It should work. I included an example data.table above that works perfectly for me. – mpjdem Dec 08 '16 at 12:52
  • Thank you very much. I will go and check my code to see what is messing it up. – CodingButStillAlive Dec 08 '16 at 13:06
  • How does this answer the memory overflow question? – David Arenburg Dec 08 '16 at 13:11
  • For some strange reason, the resulting data.table res in 'res <- dt[, {lapply(.SD,max, na.rm=T)}]' includes all column names instead of the maximum values. – CodingButStillAlive Dec 08 '16 at 13:12
  • @DavidArenburg It answered the question in the title. If a memory problem still exists afterwards, we'll see about it then (should be easy to circumvent in case of max...) CodingButStillAlive: Did you parse the header line of csv files as a row, or something like that? Are your columns numeric? – mpjdem Dec 08 '16 at 13:22
  • @mpjdem: I used the function fread with no further parameters. – CodingButStillAlive Dec 08 '16 at 14:48
  • I added a link to the concrete csv file in my question, in case someone wants to check it. However, there is something strange anyways, as the 2.9 GB file leaves a memory footprint of 11GB inside R. See my related question: https://github.com/Rdatatable/data.table/issues/1959 – CodingButStillAlive Dec 08 '16 at 15:06
  • @CodingButStillAlive Please just run `is.numeric` on one of the columns and if it is `FALSE`, check whether `colname %in% col` is `TRUE`, substituting colname for your column name and col for the actual column. I'm asking because if your columns are numbers converted to characters and the column name is also somewhere among its values, then it would indeed show up as the maximum value of that column. – mpjdem Dec 08 '16 at 15:06
  • I did: `> is.numeric(dtDate[,3]) [1] FALSE > colnames(dtDate)[3] %in% dtDate[,3] [1] FALSE` – CodingButStillAlive Dec 08 '16 at 15:13
  • I tried using the actual file. My code does work; however the console output will list the column names first and since there are so many, you won't see the maximal values anymore. But they are there, if you extract one column at a time – mpjdem Dec 08 '16 at 15:29
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/130130/discussion-between-codingbutstillalive-and-mpjdem). – CodingButStillAlive Dec 08 '16 at 15:35
  • Thanks, you are right. I dunno why I had not seen this. I am very sorry for the efforts. – CodingButStillAlive Dec 08 '16 at 15:37
  • I had just checked with `res[1]` instead of `res[,1]` cause I was expecting a vector or list as the result. Stupid mistake :( – CodingButStillAlive Dec 08 '16 at 15:43