Aggregating by a substring of column

Question

Consider the following dataframe in R:

  TYPE    VARIAVEL                           VALOR               

  A       OPER_RELAC_VARIAVEL1                100
  A       OPER_RELAC_VARIAVEL2                200
  A       OPER_RELAC_VARIAVEL3                300
  B       OPER_RELAC_VARIAVEL1                100
  B       OPER_RELAC_VARIAVEL2                200
  B       OPER_RELAC_VARIAVEL3                300
  A       CLI_RELAC_VARIAVEL1                 450
  A       CLI_RELAC_VARIAVEL2                 320
  A       CLI_RELAC_VARIAVEL3                 110

I want to take the relevance of each VALOR based on the root of VARIAVEL and TYPE. I don´t have a column with the root of the VARIAVEL, but it would be everyting before the second _ (OPER_RELAC and CLI_RELAC in this sample).

The expected result is:

  TYPE    VARIAVEL                           VALOR           RELEVANCE    

  A       OPER_RELAC_VARIAVEL1                100            0.167
  A       OPER_RELAC_VARIAVEL2                200            0.333
  A       OPER_RELAC_VARIAVEL3                300            0.500
  B       OPER_RELAC_VARIAVEL1                100            0.167
  B       OPER_RELAC_VARIAVEL2                200            0.333
  B       OPER_RELAC_VARIAVEL3                300            0.500
  A       CLI_RELAC_VARIAVEL1                 450            0.511
  A       CLI_RELAC_VARIAVEL2                 320            0.364
  A       CLI_RELAC_VARIAVEL3                 110            0.125

Since, for example, 450 represents 51.1% of the total for type A and root variable CLI_RELAC.

I have done with this sequence of commands:

1) Generate a column with the root variable using the library stringr

dados$VARIAVEL_MAE <- str_match(dados$VARIAVEL, "^([^_]+[_][^_]+)")[,2]

Thanks to R:how to get grep to return the match, rather than the whole string

2) Summarize in a new dataframe the totals aggregating by this fresh new column

TOTAIS <- aggregate(VALOR ~ Type + VARIAVEL_MAE, data = dados, sum)
names(TOTAIS) <- c('Type', 'VARIAVEL_MAE', 'TOTAL')

3) Merge this two dataframes using what is suggested here

dados <- merge(TOTAIS, dados, by = c('Type', 'VARIAVEL_MAE'))
dados$RELEVANCIA <- dados$VALOR / dados$TOTAL;

Is there a smarter way to do it or I do need all these steps?

My question is because in R, everything I do can always be replaced by something faster and smaller.

Frank · Accepted Answer · 2018-02-22T17:22:55.087

In the OP's example, we can split on _ instead of using a regex:

library(data.table)
setDT(DT)

DT[, paste0("vnome", 1:3) := tstrsplit(VARIAVEL, "_")]
DT[, z := VALOR/sum(VALOR), by=.(TYPE, vnome1, vnome2)]

   TYPE             VARIAVEL VALOR vnome1 vnome2    vnome3         z
1:    A OPER_RELAC_VARIAVEL1   100   OPER  RELAC VARIAVEL1 0.1666667
2:    A OPER_RELAC_VARIAVEL2   200   OPER  RELAC VARIAVEL2 0.3333333
3:    A OPER_RELAC_VARIAVEL3   300   OPER  RELAC VARIAVEL3 0.5000000
4:    B OPER_RELAC_VARIAVEL1   100   OPER  RELAC VARIAVEL1 0.1666667
5:    B OPER_RELAC_VARIAVEL2   200   OPER  RELAC VARIAVEL2 0.3333333
6:    B OPER_RELAC_VARIAVEL3   300   OPER  RELAC VARIAVEL3 0.5000000
7:    A  CLI_RELAC_VARIAVEL1   450    CLI  RELAC VARIAVEL1 0.5113636
8:    A  CLI_RELAC_VARIAVEL2   320    CLI  RELAC VARIAVEL2 0.3636364
9:    A  CLI_RELAC_VARIAVEL3   110    CLI  RELAC VARIAVEL3 0.1250000

In a more general case, mentioned by the OP as "everything before the second _", we can use @akrun's approach from another Q&A (assuming VARIAVEL contains no spaces):

DT[, c("vroot", "vseq") := 
  tstrsplit(sub('(^[^_]+_[^_]+)_(.*)$', '\\1 \\2', VARIAVEL), ' ')]
DT[, z := VALOR/sum(VALOR), by=.(TYPE, vroot)]

   TYPE             VARIAVEL VALOR      vroot      vseq         z
1:    A OPER_RELAC_VARIAVEL1   100 OPER_RELAC VARIAVEL1 0.1666667
2:    A OPER_RELAC_VARIAVEL2   200 OPER_RELAC VARIAVEL2 0.3333333
3:    A OPER_RELAC_VARIAVEL3   300 OPER_RELAC VARIAVEL3 0.5000000
4:    B OPER_RELAC_VARIAVEL1   100 OPER_RELAC VARIAVEL1 0.1666667
5:    B OPER_RELAC_VARIAVEL2   200 OPER_RELAC VARIAVEL2 0.3333333
6:    B OPER_RELAC_VARIAVEL3   300 OPER_RELAC VARIAVEL3 0.5000000
7:    A  CLI_RELAC_VARIAVEL1   450  CLI_RELAC VARIAVEL1 0.5113636
8:    A  CLI_RELAC_VARIAVEL2   320  CLI_RELAC VARIAVEL2 0.3636364
9:    A  CLI_RELAC_VARIAVEL3   110  CLI_RELAC VARIAVEL3 0.1250000

Data:

DT = structure(list(TYPE = c("A", "A", "A", "B", "B", "B", "A", "A", 
"A"), VARIAVEL = c("OPER_RELAC_VARIAVEL1", "OPER_RELAC_VARIAVEL2", 
"OPER_RELAC_VARIAVEL3", "OPER_RELAC_VARIAVEL1", "OPER_RELAC_VARIAVEL2", 
"OPER_RELAC_VARIAVEL3", "CLI_RELAC_VARIAVEL1", "CLI_RELAC_VARIAVEL2", 
"CLI_RELAC_VARIAVEL3"), VALOR = c(100L, 200L, 300L, 100L, 200L, 
300L, 450L, 320L, 110L)), .Names = c("TYPE", "VARIAVEL", "VALOR"
), row.names = c(NA, -9L), class = "data.frame")

Plus 1, but this doesn't seem particularly complicated for regex though. Couldn't we just do something like `by=.(TYPE, sub("(.*_.*)_.*", "\\1", VARIAVEL))` or something similar? — A5C1D2H2I1M1N2O1R2T1, Feb 22 '18 at 17:01
ALL the times I get surprised how things in R can be simpler than I think... Thanks!!!! But something we have to considere is that the variable could by `CLI_RELAC_PART1_PART2` — Nizam, Feb 22 '18 at 17:08
@a5c Yeah, a regex would also be more robust against the case in Nizam's comment. I will add yours or something like it, just a sec. Side note: I like to do regex at most once and very early in data processing, so I'd make a new col rather than doing it on-the-fly in `by=`. — Frank, Feb 22 '18 at 17:12

Aggregating by a substring of column

1 Answers1