3

I have been able to create a table with the median mean and sd, based on several conditions, using the dplry package. To do so I prepped the data. It looks a bit sloppy to me and I am using a big dataset (>3gb) so I have been using data.table already.

I am wondering how I can create a similar table more efficiently using the data.table package, so, if it is possible to do this without prepping the data too much. That would be great as I need to do this for many other conditions as well. Btw, I checked out this post but it didn't add conditions.

This is what I have and need:

I need the mean of con by id and date= 1 year since first i or i2 = A04 or A01.

EDIT (because it wasn't running before):

DATA:

DT <- structure(list(id = c(123L, 123L, 332L, 332L, 332L, 100L, 100L, 113L, 113L, 113L, 113L, 551L, 551L), 
               i = c("D95", "F85", "A01", "A04", "K20", "B10", "A04", "T08", "P28", "D95", "A04", "B12", "D95"), 
               i2 = c("F15", "", "", "", "", "", "", "", "", "A01", "", "A01", ""), 
               date = c("2015-06-19", "2016-08-15", "2013-03-16", "2017-01-17", "2013-01-16", "2009-05-08", "2011-04-03", "2015-05-04",
                        "2011-04-04", "2017-08-04", "2011-05-24", "2013-11-04", "2013-05-04"), 
               con = c("1", "1", "1", "1", "1", "1", "1", "1", "1", "1",  "1", "1", "1"), 
               PS = c("0",  "0", "1", "0", "0", "0", "1", "0", "0", "0", "1", "1", "0")), 
          class = "data.frame", row.names = c(NA, -13L))

The following is what I have done so far (and which works):

I created the columns PS (= having either A01 or A04 in i or i2) and ds(=days since first A01 or A01 (days aren't correct here)) based on the data.

With the dplyr package I first made a data.table with only rows with only ds between -365&0.

j.ds <- subset(DT, ds >= -365 & ds < 0)

Than agregated to a table with sum of con per id, like so:

j <- j.ds %>%
  group_by(id) %>% 
  summarise(con = sum(con))

From there I made the desired table:

jP <- j %>%
  summarise(median = median(con), mean = mean(con), SD = sd(con))
Wilkit
  • 87
  • 5

2 Answers2

2

As far as I can tell you want the sum of con per id, for data where ds is between -365 and 0, then the mean, median and standard deviation of those sums. This can be done by chaining two expressions.

DT[ds >= -365 & ds < 0, sum(con), by = "id"][, .(mean(V1), median(V1), sd(V1))]

This first subsets your data, then sums con per id, then takes the mean etc. of the sums with a second expression

rg255
  • 4,119
  • 3
  • 22
  • 40
  • Thank you, but I get the following error `Error in gsum(con) : Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)` – Wilkit Apr 25 '20 at 12:32
  • 1
    You have con as a character, use `as.numeric(con)` or convert it in the data – rg255 Apr 25 '20 at 12:34
  • Okay, so no shortcuts there ;) It seems to work on the test set `DT`, thanks. Going to try it on my larger set. Do you also have solution to getting these results from a dataset that doesn't contain the columns `PS` `ds` and `con` (yet)? – Wilkit Apr 25 '20 at 12:38
  • 1
    Also, if con is always 1, you could use .N, rather than sum(con), though you would need to change the mean(V1) to be mean(N) etc. Also if this answers your question please accept/upvote :) – rg255 Apr 25 '20 at 12:39
  • As an example, using the iris dataset... `DT <- data.table(iris); DT[Petal.Length > 1 & Petal.Length < 5, .N, by = "Species"][, .(mean(N), sd(N))]` – rg255 Apr 25 '20 at 12:43
  • 1
    It all works! Changing to using `.N` works, although when I changed mean(V1) etc. to mean(N) it gave the overall frequency (I think). Back to mean(V1) gave me the desired result. Thank you so much and could you also upvote my question so I can finally get rights to comment in other posts instead of making StackOverflow more cluttered with questions related to existing posts? – Wilkit Apr 25 '20 at 12:52
1

Since I cannot run your example, I give you an untested data.table translation of your dplyr code.

You can do the operations you were doing quite efficiently in data.table:

library(data.table)
setDT(jx)
j <- jx[ds >= -365 & ds < 0, lapply(.SD, sum), by = "id"]
jP <- j[, lapply(.SD, function(x) return(c(median(x), mean(x), sd(x)))]
jP[,'stat' := c("median","mean","sd")

You end-up with a long formatted dataframe.

Edit

If you have non-numeric columns, you must not sum them. You can control the columns used in .SD with .SDcols. For instance, in your example, the columns are character: to do a sum you need to convert them to numeric. For instance, if you are interested in con and PS columns, you should do:

cols <- c("con","PS")
j <- DT[,lapply(.SD, function(x) sum(as.numeric(x))), by = "id", .SDcols = cols]

With function(x) sum(as.numeric(x)) you create an anonymous function (equivalent to lambda functions in python that applies as.numeric and sum all elements in your subset of data (SD).

The anonymous function return(c(median(x), mean(x), sd(x))) follows the same logic

linog
  • 5,786
  • 3
  • 14
  • 28
  • The 3rd line gives the following error: `Error in gsum(i) : Type 'character' not supported by GForce sum (gsum). Either add the prefix base::sum(.) or turn off GForce optimization using options(datatable.optimize=1)` And I've edited my dataset, sorry about that! Can you also explain what the x in function(x) means? I've seen it elsewhere but don't understand where it comes from. – Wilkit Apr 25 '20 at 12:27
  • 1
    You were summing over character variables. See my edit – linog Apr 25 '20 at 12:53
  • 1
    I also answer your question regarding the `function(x)` notation – linog Apr 25 '20 at 12:54
  • I've run: `cols <- c("con","PS", 'ds', 'id', 'i', 'i2') j <- DT[,lapply(.SD, function(x) sum(as.numeric(x))), by = "id", .SDcols = cols] j <- DT[ds >= -365 & ds < 0, lapply(.SD, sum), by = "id"]` And still get the error: `Error in sum(i) : invalid 'type' (character) of argument` Also tried after dropping the `i` and `i2` columns, but still similar error. Luckily rg255's sollution works. Thanks, also for the rest of the explanations, that clears things up for my beginner brain. – Wilkit Apr 25 '20 at 13:05
  • You're welcome. The error comes from the fact that you no longer need `j <- DT[ds >= -365 & ds < 0, lapply(.SD, sum), by = "id"]` since you already aggregated with `j <- DT[,lapply(.SD, function(x) sum(as.numeric(x))), by = "id", .SDcols = cols] `. – linog Apr 25 '20 at 13:22