Consider the following dataframe in R:
TYPE VARIAVEL VALOR
A OPER_RELAC_VARIAVEL1 100
A OPER_RELAC_VARIAVEL2 200
A OPER_RELAC_VARIAVEL3 300
B OPER_RELAC_VARIAVEL1 100
B OPER_RELAC_VARIAVEL2 200
B OPER_RELAC_VARIAVEL3 300
A CLI_RELAC_VARIAVEL1 450
A CLI_RELAC_VARIAVEL2 320
A CLI_RELAC_VARIAVEL3 110
I want to take the relevance of each VALOR
based on the root of VARIAVEL
and TYPE
. I don´t have a column with the root of the VARIAVEL
, but it would be everyting before the second _
(OPER_RELAC
and CLI_RELAC
in this sample).
The expected result is:
TYPE VARIAVEL VALOR RELEVANCE
A OPER_RELAC_VARIAVEL1 100 0.167
A OPER_RELAC_VARIAVEL2 200 0.333
A OPER_RELAC_VARIAVEL3 300 0.500
B OPER_RELAC_VARIAVEL1 100 0.167
B OPER_RELAC_VARIAVEL2 200 0.333
B OPER_RELAC_VARIAVEL3 300 0.500
A CLI_RELAC_VARIAVEL1 450 0.511
A CLI_RELAC_VARIAVEL2 320 0.364
A CLI_RELAC_VARIAVEL3 110 0.125
Since, for example, 450 represents 51.1% of the total for type A
and root variable CLI_RELAC
.
I have done with this sequence of commands:
1) Generate a column with the root variable using the library stringr
dados$VARIAVEL_MAE <- str_match(dados$VARIAVEL, "^([^_]+[_][^_]+)")[,2]
Thanks to R:how to get grep to return the match, rather than the whole string
2) Summarize in a new dataframe the totals aggregating by this fresh new column
TOTAIS <- aggregate(VALOR ~ Type + VARIAVEL_MAE, data = dados, sum)
names(TOTAIS) <- c('Type', 'VARIAVEL_MAE', 'TOTAL')
3) Merge this two dataframes using what is suggested here
dados <- merge(TOTAIS, dados, by = c('Type', 'VARIAVEL_MAE'))
dados$RELEVANCIA <- dados$VALOR / dados$TOTAL;
Is there a smarter way to do it or I do need all these steps?
My question is because in R, everything I do can always be replaced by something faster and smaller.