How to calculate a formula that takes different columns of a dataframe with the same suffix in the name and create a new column?

Question

I have a dataframe in R that contains the following columns structure (in a bigger scale):

  Material_code  actual_202009  actual_202010  actual_202011  pred_202009  pred_202010  pred_202011  
      111              30              44              24            25           52           27
      112              19              70              93            23           68           100

I would like to add new columns to the dataframe containing the respective error measure:

|actual - pred|/ actual * 100%

Obtaining this:

Material_code  actual_202009  actual_202010  actual_202011  pred_202009  pred_202010  pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
      111              30              44              24            25           52          27     16.67%      18.18%       12.5%
      112              19              70              93            23           68          100    21.05%       2.86%        7.52%

I tried to create the new columns using ends_with() to select the previuous, but I am not getting it right. Can you please help?

*** EDIT to include easier way to generate the dataframe

df <- data.frame(Material_code = c(111,112),
                    actual_202009 = c(30,19),
                    actual_202010 = c(44,70),
                    actual_202011 = c(24,93), 
                    pred_202009 = c(25,23),
                    pred_202010 = c(52,68),
                    pred_202011 = c(27,100))

please share `dput(toyData)` so it's easier to copy and work with it. — PavoDive, Jan 28 '21 at 10:45
@PavoDive is this what you mean, for example? ```dummy <- data.frame(Material_code = c(111,112,113),actual_202009 = c(30,19,45), actual_202010 = c(44,70,67), actual_202011 = c(24,93,46), pred_202009 = c(25,23,50), pred_202010 = c(52,68,NA), pred_202011 = c(27,100,40))``` — jessirocha, Jan 28 '21 at 13:28

score 1 · Answer 1 · answered Jan 28 '21 at 10:49

1

A bit more verbose from the tidyverse:

library(tidyverse)
df %>%
  pivot_longer(cols = -Material_code) %>%
  separate(name, into = c("type", "time"), sep = "_") %>%
  pivot_wider(names_from = type) %>%
  mutate(MAPE = abs(actual - pred)/actual*100) %>%
  pivot_wider(values_from = c(actual, pred, MAPE),
              names_from = time)

gives:

# A tibble: 2 x 10
  Material_code actual_202009 actual_202010 actual_202011 pred_202009 pred_202010 pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
          <int>         <int>         <int>         <int>       <int>       <int>       <int>       <dbl>       <dbl>       <dbl>
1           111            30            44            24          25          52          27        16.7       18.2        12.5 
2           112            19            70            93          23          68         100        21.1        2.86        7.53

answered Jan 28 '21 at 10:49

deschen

10,012
3
27
50

Thank you for the help. It seems like a sophisticated solution. However, I obtain a different output when using your code. ´´´A tibble: 6 x 11 Material_code sAPE actual_202009 actual_202010 actual_202011 pred_202009 pred_202010 pred_202011 MAPE_202009 1 111 0.0909 30 NA NA 25 NA NA 16.7 2 111 0.0833 NA 44 NA NA 52 NA NA ´´´ – jessirocha Jan 28 '21 at 14:38
If you get a different result, then something does not align in your code, e.g. the data you posted above is not the one you were using for testing my code. I just double-checked with your df data and I get exactly the results you want (which is also identical to what @Ronak Shah posted). – deschen Jan 28 '21 at 14:45
you are right, I don't know what I changed, but it works now. Thanks a lot! – jessirocha Jan 29 '21 at 05:51

score 1 · Answer 2 · answered Jan 28 '21 at 12:23

You will help yourself a lot if you try to keep your data in long format: each column has the same kind of data. Your table is in wide format, very useful for excel and human visualization, but very cumbersome to deal with in code.

So the first thing you need to do (that's what @deschen did in their answer) is converting your data to long, and then operate on it. A long version of your data will be of the form

Material_code    Type    Date   Value
          111  actual  202011      30

I will provide a data.table solution, that is basically the same as @deschen's. You may like this one for its speed on large data.

library(data.table)

setDT(df1)

df1[, melt(.SD, 1)][, 
               c("type", "date") := tstrsplit(variable, "_", fixed = TRUE)][,
                     dcast(.SD, Material_code + date ~ type)][, 
                         mape := 100 * abs(actual - pred) / actual][]

melt(.SD, 1) converts your table from wide to long, keeping only the first column as reference for each record.
c("type", "date") := tstrsplit(variable, "_", fixed = TRUE) creates columns type and date with the corresponding values taken from variable (after melting, variable has the former column names).
dcast(.SD, Material_code + date ~ type) converts the long table into wide again. This time, Material_code and date will be kept in columns, and type will be casted into new columns actual and pred.
The := is an assignment operator. It creates variable mape and assigns the resulting value.
The last bit, [] isn't actually needed. Is there so the result is printed to screen. If you don't need to print the new table to screen, omit it.

Thank you very much! Your code runs very fast. For me it is was extremely helpful the explanation you provided. I really appreciate it. — jessirocha, Jan 29 '21 at 06:28

score 0 · Accepted Answer · answered Jan 28 '21 at 10:42

Get the column names of all 'actual' and 'pred' columns and you can perform all the mathematical calculations on them directly.

actual_cols <- sort(grep('actual', names(df), value = TRUE))
pred_cols <- sort(grep('pred', names(df), value = TRUE))
new_cols <- sub('pred', 'MAPE', pred_cols)

df[new_cols] <- abs(df[actual_cols] - df[pred_cols])/df[actual_cols] * 100
df

#  Material_code actual_202009 actual_202010 actual_202011 pred_202009
#1           111            30            44            24          25
#2           112            19            70            93          23

#  pred_202010 pred_202011 MAPE_202009 MAPE_202010 MAPE_202011
#1          52          27        16.7       18.18       12.50
#2          68         100        21.1        2.86        7.53

data

df <- structure(list(Material_code = 111:112, actual_202009 = c(30L, 
19L), actual_202010 = c(44L, 70L), actual_202011 = c(24L, 93L
), pred_202009 = c(25L, 23L), pred_202010 = c(52L, 68L), pred_202011 = c(27L, 
100L)), class = "data.frame", row.names = c(NA, -2L))

How to calculate a formula that takes different columns of a dataframe with the same suffix in the name and create a new column?

3 Answers3

Linked