4

I am working with data.table package. I have a data table which represents users actions on a website. Let's say that every user can visit a website, and perform multiple actions on it. My original data table is of actions (every row is an action) and I want to aggregate this information into a new data table, grouped by user visits (every visit has a unique ID). There are some fields which are shared by the actions of the same visit - for example - the user name, the user status, the visit number etc. At least one of the actions of each visit contains this info (not necessarily all of the actions). I want to retrieve, for each visit (= group of actions with the same visit ID), the value of this field, and set it to the visit in the visits new data table. For example, if I have the following original data table:

VisitID     ActionNum    UserName   UserStatus    VisitNum   ActionType
aaaaaaa        1           John        Active        5           x
aaaaaaa        2                       Active                    y
aaaaaaa        3           John                      5           z
bbbbbbb        1                      NonActive                  w
bbbbbbb        2           Dan                       7           t

I want to have a visits data table, as following:

VisitID  UserName   UserStatus   VisitNum
aaaaaaa   John       Active        5
bbbbbbb   Dan        NonActive     7

I created a function that works on subset of data table (only the rows of the visit) and a field, and this function should be applied on several fields (UserName, UserStatus, VisitNum).

getGeneralField<- function(visitDT,field){
  vec = visitDT[,get(field)]
  return (unique(vec[vec != ""])[1])
}

The problem is that every trial to apply this function on .SD when by=VisitID results in something different than I planned... What is the best way to do it? I used !="" in order to avoid blank cells.

user3017075
  • 351
  • 3
  • 16

1 Answers1

6

We specify the columns of interest in .SDcols, grouped by 'VisitID', loop through the columns in .SDcols (lapply(.SD, ...) and get the first non-blank element

dt[, lapply(.SD, function(x) x[nzchar(x)][1]), by = VisitID, .SDcols = 3:5]
akrun
  • 874,273
  • 37
  • 540
  • 662