3

I would like to explore the best way to melt a data.table with na.rm applying only for the first element of the list of measure.vars.

I have a data.table as follows:

library(data.table)
library(lubridate)

dt.master <- data.table(user = seq(1,5),
                    visit_id = c(2,4,NA,4,8),
                    visit_date = c(dmy("10/02/2018"), dmy("11/04/2018"), NA, dmy("02/03/2018"), NA),
                    offer_id = c(1,3,NA,NA,NA),
                    offer_date = c(dmy("15/02/2018"), dmy("18/04/2018"), NA, NA, NA))

With dt.master:

   user visit_id visit_date offer_id offer_date
1:    1        2 2018-02-10        1 2018-02-15
2:    2        4 2018-04-11        3 2018-04-18
3:    3       NA       <NA>       NA       <NA>
4:    4        4 2018-03-02       NA       <NA>
5:    5        8       <NA>       NA       <NA>

I want to get, for each user, the "story" of commercial activity (that is: their visits and their offers).

dt.melted <- melt(dt.master,
                  id.vars = "user",
                  measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")),
                  variable.name = "level",
                  value.name = c("level_id", "level_date"))

With dt.melted:

    user level level_id level_date
 1:    1     1        2 2018-02-10
 2:    2     1        4 2018-04-11
 3:    3     1       NA       <NA>
 4:    4     1        4 2018-03-02
 5:    5     1        8       <NA>
 6:    1     2        1 2018-02-15
 7:    2     2        3 2018-04-18
 8:    3     2       NA       <NA>
 9:    4     2       NA       <NA>
10:    5     2       NA       <NA>

However, I don't want NAs to appear in the level_id column, i.e:

   user level level_id level_date
1:    1     1        2 2018-02-10
2:    2     1        4 2018-04-11
3:    4     1        4 2018-03-02
4:    5     1        8       <NA>
5:    1     2        1 2018-02-15
6:    2     2        3 2018-04-18

Unfortunately, the data quality of the sample is really bad, so level_date is not always available. Thus, a na.rm = T is not valid, as I would get:

dt.melted.na <- melt(dt.master,
                     id.vars = "user",
                     measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")),
                     variable.name = "level",
                     value.name = c("level_id", "level_date"),
                     na.rm = TRUE)

With dt.melted.na:

   user level level_id level_date
1:    1     1        2 2018-02-10
2:    2     1        4 2018-04-11
3:    4     1        4 2018-03-02
4:    1     2        1 2018-02-15
5:    2     2        3 2018-04-18

Is there a way to use na.rm = TRUE only for the first element of the list in measure.vars? I am currently exploring other workarounds (like filling visit_date and offer_date with "false" dates when visit_id and offer_id are available), but I would like to know if there is an elegant solution.

Uwe
  • 41,420
  • 11
  • 90
  • 134
illanxr
  • 65
  • 5
  • 5
    Is there a reason you can't just filter `dt.melted` to remove rows with `NA`? `dt.melted <- dt.melted[!is.na(dt.melted$level_id),]` or `dt.melted %>% filter(!is.na(level_id))` – divibisan May 09 '18 at 13:05
  • Or, similarly, you could filter the table `dt.master` during the melt by just tossing a `[!is.na(visit_id)]` on it. – C-x C-c May 09 '18 at 13:07
  • Thanks! regarding filtering `dt.melted` to remove rows with `NA`, I am currently working with a list of 5 elements for `measure.vars` (not just ID and date, but other characteristics), a number of elements for each list of 7, and a `dt.master` of ~2,000,000 rows. If I don't use `na.rm = T`, I run out of memory, so I am exploring other options – illanxr May 09 '18 at 13:19
  • 1
    This is kind of hacky, but what if you replaced every `NA` in `dt.master$offer_date` and `dt.master$level_date` with the string `` (or something else that is technically not an `NA`). Then after you run `melt.data.table` with `na.rm = T`, you can replace them with real `NA`s – divibisan May 09 '18 at 13:32
  • Thanks!! That is what I am currently doing. I was just looking if there is an elegant solution but, in the meanwhile, I am following that same workaround :) – illanxr May 09 '18 at 13:35

1 Answers1

1

An elegant solution would be if the na.rm parameter to melt() would take a vector of boolean values, one for each element in the list of measure.vars, e.g.,

melt(dt.master,
     id.vars = "user",
     measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")),
     variable.name = "level",
     value.name = c("level_id", "level_date"),
     na.rm = c(TRUE, FALSE))   # not possible with data.table v1.11.0

As this feature has not been implemented yet, an alternative approach would be to add the missing rows after reshaping to long form with na.rm = TRUE. The OP has pointed out that na.rm = TRUE has to be used due the problem size and memory limitations.

rbind(
  dt.melted.na,
  dt.master[!is.na(visit_id) & is.na(visit_date), .(user, level = 1L, level_id = visit_id)],
  dt.master[!is.na(offer_id) & is.na(offer_date), .(user, level = 2L, level_id = offer_id)],
  fill = TRUE
)
   user level level_id level_date
1:    1     1        2 2018-02-10
2:    2     1        4 2018-04-11
3:    4     1        4 2018-03-02
4:    1     2        1 2018-02-15
5:    2     2        3 2018-04-18
6:    5     1        8       <NA>

This approach is rather hacky and verbose but might help to overcome the memory limitations. It is essentially a reshaping "by hand" for the missing rows.

There is another alternative which might be less verbose:

incomplete_rows <- 
  melt(dt.master[!is.na(visit_id) & is.na(visit_date) | !is.na(offer_id) & is.na(offer_date)],
       id.vars = "user",
       measure.vars = list(c("visit_id", "offer_id"), c("visit_date", "offer_date")),
       variable.name = "level",
       value.name = c("level_id", "level_date"))[!is.na(level_id)]
rbind(
  dt.melted.na,
  incomplete_rows
)

Here, all rows are picked from dt.master which are partially incomplete, reshaped to long format and filtered afterwards. If this involves only a small fraction of rows of dt.master this might work also with limited memory.

Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Thanks!! I totally agree that having `na.rm = c(TRUE, FALSE)` would be the most elegant solution (I hope that gets implemented in future updates). In the meantime, I find second option to be a good approach. – illanxr May 10 '18 at 07:32