1

I have data in the following format (excerpt):

# install.packages("data.table")
# install.packages("tidyverse")

library("data.table")
library("tidyverse")

dt <- data.table(
 date = lubridate::as_date(c("2021-01-01", "2021-01-02", "2021-01-03", "2021-01-04")),
 location = c("Westpark", "Northpark", "Estpark", "Southpark"),
 'receiver_a: max' = c(20, 30, 25, 15),
 'receiver_a: min' = c(10, 15, 20, 5),
 'receiver_b: max' = c(15, 45, 10, 50),
 'receiver_b: min' = c(15, 45, 10, 50)
)
> dt
        date  location receiver_a: max receiver_a: min receiver_b: max receiver_b: min
1: 2021-01-01  Westpark              20              10              15              15
2: 2021-01-02 Northpark              30              15              45              45
3: 2021-01-03   Estpark              25              20              10              10
4: 2021-01-04 Southpark              15               5              50              50

I now want to split all columns starting with receiver_... in the column name and convert them to a long format:

  • Split all columns with receiver_... in the column name into new columns receiver, max and min. Everything after the : in the "old" column names starting with receiver... is used for the new column names, here max and min.
  • The new column receiver contains e.g. receiver_a, receiver_b etc as values. The new columns max and min contain the respective numerical values.

This can be implemented with tidyr::pivot_longer():

# dt <-  dt %>% 
dt %>%
  tidyr::pivot_longer(
    cols         = dplyr::contains(":"),
    names_to     = c("receiver", ".value"),
    names_sep    = ": ", 
    names_repair = "minimal"
  )
# A tibble: 8 x 5
  date       location  receiver     max   min
  <date>     <chr>     <chr>      <dbl> <dbl>
1 2021-01-01 Westpark  receiver_a    20    10
2 2021-01-01 Westpark  receiver_b    15    15
3 2021-01-02 Northpark receiver_a    30    15
4 2021-01-02 Northpark receiver_b    45    45
5 2021-01-03 Estpark   receiver_a    25    20
6 2021-01-03 Estpark   receiver_b    10    10
7 2021-01-04 Southpark receiver_a    15     5
8 2021-01-04 Southpark receiver_b    50    50
  • cols = dplyr::contains(":"): select all columns with : inside the name, for example receiver_a: max
  • names_to = c("receiver", ".value"): Split selected columns into receiver and .value. .value indicates that component of the name defines the name of the column containing the cell values
  • names_sep = ": " If names_to contains multiple values, these arguments control how the column name is broken up, here by : (whitespace after colon)

My question: Can this also be done with a (faster) data.table solution (e.g. with melt())?

Mark Black
  • 39
  • 5
  • In the `data.table` [development version (`1.14.1`)](https://github.com/Rdatatable/data.table/wiki/Installation#v1141-in-development--) `melt` has a new `measure` argument. Thus, you could do `melt(dt, measure.vars = measure(rec, value.name, sep = ": "))`. I have tried to explain how it works in [another post](https://stackoverflow.com/a/41884029/1851712). That post originally dealt with a `pre-1.14.1` workaround for a set-up similar to yours (as used in the answer below). – Henrik Jul 07 '21 at 20:31
  • @Henrik: Excellent hint regarding the further development of `data.table` with the new argument `measure` from `melt`: this will make this procedure easier and more streamlined in the future. – Mark Black Jul 08 '21 at 05:25
  • Indeed. But you don't have to wait to the future. You can install the development version (if not blocked by your department) and try it already today ;) Good luck! – Henrik Jul 08 '21 at 09:07
  • 1
    Already executed and successfully tested! – Mark Black Jul 08 '21 at 12:27

1 Answers1

1

We can use measure with patterns in data.table

library(data.table)
nm1 <- unique(sub(":.*", "", names(dt)[-(1:2)]))
melt(dt, measure = patterns("max", "min"),
    value.name = c("max", "min"), variable.name = "receiver")[, 
     receiver := nm1[receiver]][]

-output

         date  location   receiver max min
1: 2021-01-01  Westpark receiver_a  20  10
2: 2021-01-02 Northpark receiver_a  30  15
3: 2021-01-03   Estpark receiver_a  25  20
4: 2021-01-04 Southpark receiver_a  15   5
5: 2021-01-01  Westpark receiver_b  15  15
6: 2021-01-02 Northpark receiver_b  45  45
7: 2021-01-03   Estpark receiver_b  10  10
8: 2021-01-04 Southpark receiver_b  50  50
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Another question: With your solution, do we need to know the exact name of the new column names in `value.name = c("max", "min")`? With `pivot_longer()` or the new `measure` argument of `melt` (dep. version 1.14.1, see @Henrik) the new column names (except `receiver`) do not need to be specified. – Mark Black Jul 08 '21 at 05:15
  • @MarkBlack Yes. But, if you are okay to use dev version, then you may have that advantage. Often times, dev version would have some bugs – akrun Jul 08 '21 at 16:15
  • I'll use your solution either way for now, since I don't like to work with dev versions in the "productive" environment. However, I will implement @Henrik's solution as soon as the official version is released. – Mark Black Jul 08 '21 at 20:10
  • @MarkBlack I would suggest to create a New Project in Rstudio with renv or so and then install the dev version and test it instead of installing on the whole system. In that way, you can test for different versions. Or may spin a docker container and install the dev version on it. – akrun Jul 08 '21 at 20:12