3

I have a 22 million observation rows data table of the following form:

`dt <- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c("AN_BILANT", "OPEX", "CAPEX","AN_BILANT","OPEX", "CAPEX", "AN_BILANT", "OPEX", "CAPEX", "AN_BILANT","OPEX", "CAPEX"),
  value = c(2013, 10, 3,2014, 11, 5, 2007, 25, 10, 2009, 23, 7)
)`

I would like to generate the following output using data.table

`output_dt <- data.table(
  firm_id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
  metric = c("OPEX", "CAPEX","OPEX", "CAPEX", "OPEX", "CAPEX", "OPEX", "CAPEX"),
AN_BILANT = c(2013, 2013, 2014, 2014, 2007, 2007, 2009) 
  value = c( 10, 3,11, 5, 25, 10,23, 7)
)
`

I initially tried the following:

dcast(dt[metric == "AN_BILANT"], firm_id ~ metric, value.var = "value", fun.aggregate = function(x) x)

but I get the following error

Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I also tried

dcast.data.table(dt[, N:=1:.N, metric], firm_id~metric, subset = (metric=="AN_BILANT") ) Here I get the warning

Aggregate function missing, defaulting to 'length'

2 Answers2

5

I like akrun's approach, but if the data forces you to choose an alternative (in case cumsum is too sensitive to ordering of data), you can try a dcast/melt approach like below. Note that since firm_id does not contain just one of each metric, we need to add another variable temporarily so that we don't over-reduce during the initial dcast.

library(data.table)
dcast(DT[, grp := seq_len(.N), by = .(firm_id, metric)],
      firm_id + grp ~ metric, value.var = "value")[, grp := NULL] |>
  melt(c("firm_id", "AN_BILANT"), variable.name = "metric")
#    firm_id AN_BILANT metric value
#      <num>     <num> <fctr> <num>
# 1:       1      2013  CAPEX     3
# 2:       1      2014  CAPEX     5
# 3:       2      2007  CAPEX    10
# 4:       2      2009  CAPEX     7
# 5:       1      2013   OPEX    10
# 6:       1      2014   OPEX    11
# 7:       2      2007   OPEX    25
# 8:       2      2009   OPEX    23

Granted, the order of the rows is not the same, but double-reshaping typically does not guarantee that.

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    Just curions, Why not use data.tables inherent `[` function instead of the native pipe? ie `, grp := NULL ][,melt(.SD, c("firm_id", "AN_BILANT"), variable.name = "metric")]`? – Onyambu Feb 27 '23 at 03:57
  • Notice that both `][` are on the next line. Technically it does not matter where you place them whether at the end of the previous line or at the start of the next line as long as they are side by side. – Onyambu Feb 27 '23 at 04:00
  • 1
    yeah, that can work too; sometimes I like to break it out since it does a reasonable job of separating each step visually @onyambu – r2evans Feb 27 '23 at 05:23
  • This has been helpful as well, and I can see it being useful in situations without order within the data! – newmathlearner_7 Feb 28 '23 at 22:58
4

We may create the 'AN_BILANT' by assiging (:=) the first 'value' after grouping by the cumulative sum of a logical vector and then remove those 'AN_BILANT' rows

library(data.table)
dt[, AN_BILANT :=  value[1], cumsum(metric == 'AN_BILANT')][
    metric != 'AN_BILANT']

-output

    firm_id metric value AN_BILANT
1:       1   OPEX    10      2013
2:       1  CAPEX     3      2013
3:       1   OPEX    11      2014
4:       1  CAPEX     5      2014
5:       2   OPEX    25      2007
6:       2  CAPEX    10      2007
7:       2   OPEX    23      2009
8:       2  CAPEX     7      2009
akrun
  • 874,273
  • 37
  • 540
  • 662