1

I need to collapse partially duplicative rows in a data.table

Example:

  df.in <- data.table (
  tkr = c("abc", "abc", "def", "def", "ghi", "ghi"),
  lboq = c(0,296, -390, 0,  -88, 0),
  locq = c(-296,0, 0, 390, 0, 88),
  ap = c(134,134, 23, 23, 17, 17)
  )

     tkr lboq locq  ap
  1: abc    0 -296 134
  2: abc  296    0 134
  3: def -390    0  23
  4: def    0  390  23
  5: ghi  -88    0  17
  6: ghi    0   88  17

Using this example, I want to end up with three rows where the second row collapses into the first with 296 in column 2; row 4's 390 would be in column 3 of row 3, etc.

So the desired result would be:

  df.out <- data.table (
  tkr = c("abc", "def", "ghi"), 
  lboq = c(296, -390, -88),
  lbcq = c(-296, 390, 88),
  ap = c(134, 23, 17)
  )

     tkr lboq lbcq  ap
  1: abc  296 -296 134
  2: def -390  390  23
  3: ghi  -88   88  17

I have studied the following with the reaction indicated:

collapse rows in 2 different columns in data.table?

collapse rows in data.table

Despite the titles in the above two, they looks to be more of an expansion than collapsing

I also explored using coalesce in the following, but that appears to be for columns only; is there a coalesce for rows?

Coalesce two string columns with alternating missing values to one

Use dplyr coalesce in programming

How to use Coalesce function on a dataframe

I then looked at

R Summarize Collapsed Data.Table

but the collapsing already done here and needs summarizing; I need to do the collapsing

this looks to be exactly what I need

Merging complementary rows of a dataframe with R

but adapting it to my use doesn't seem to work:

    df.in %>%
      group_by(tkr, ap) %>%
      summarise_all(funs(sort(.)[1]))

What am I missing?

W Barker
  • 324
  • 2
  • 8

1 Answers1

4

Maybe this will help :

library(dplyr)

df.in %>%
  group_by(tkr) %>%
  summarise(across(lboq:ap, ~.x[.x != 0][1]))

#  tkr    lboq  locq    ap
#* <chr> <dbl> <dbl> <dbl>
#1 abc     296  -296   134
#2 def    -390   390    23
#3 ghi     -88    88    17

For each tkr this selects the 1st non-zero value in columns lboq:ap.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks! More coffee coming! (It is obvious to me that I haven't yet been able to fully internalize the use of the tilde and dots). I've been able to successfully adapt your solution to my much larger dataset, but the bracketed one "[1]" prompts two questions: 1) it appears that that is part of the `across` syntax, but I see no explanation of it in `?across`; 2) my larger dataset has four pairs of what I used in the example ("lboq" and "locq") and your code works fine. I'm not complaining (!), but it makes me wonder what you meant by your comment "1st non-zero value". – W Barker Mar 13 '21 at 11:36
  • For the example shared there are two cases possible. 1) When there is one zero value and another non-zero value.(`Eg - lboq` for `abc`) 2) When the value is the same (Eg - `ap` for `abc`). In your actual data do you have another case where you have values like 10, 12 in one column for one group, which value will you select in that case? If the answer is no, there cannot be such case in real data then no need to worry the answer would work as expected but if it is possible that such case exist in which case `[1]` will ensure that 10 gets selected and not 12 as we have to select only one value. – Ronak Shah Mar 13 '21 at 11:51
  • @RonakShah how can I make this work for `char` type columns? My example is similar to OP question but `lboq:ap` ara characters with `NA`s where there's no value – blazej Aug 30 '21 at 14:25
  • You can use `~na.omit(.x)[1]`. – Ronak Shah Aug 31 '21 at 01:09