0

Let's say I have a data frame like this:

> df = data.frame(id = c(2,88,4), sale1 = c(100,200,50), sale2 = c(50,150,50), sale3 = c(60, 100, 75))
> df
  id sale1 sale2 sale3
1  2   100    50    60
2 88   200   150   100
3  4    50    50    75

and i want to find the percent difference between each two sales seasons for each item id. so basically:

[sale_n+1] - [sale_n])/[sale_n] for each row, where n = sales column index

and i want to do that across all the columns, and put the percent diff into a new table.

so the end result table will contain the "id" column plus N-1 columns, where N = number of columns (other than id column) in the original data frame.

for the example df i gave above, the desired output would be (all values other than id column are in percentage):

> df_diff
  id sale12 sale23
1  2    -50     20
2 88    -25    -33
3  4      0     50

I've read post like the following:

  1. How to calculate percentage change from different rows over different spans
  2. Calculate percentage change in an R data frame

but the solutions provided in 1 does not scale (since i have so many columns!), and 2 (which uses lag()) seems like it's doing difference by row.

any help (hints, pointer, redirect) will be appreciated!

end note

would also greatly appreciate if you could teach me a way to smartly name the new result table's columns accordingly; like the columns will be named (in this given example): "sale12" and "sale 23"

Community
  • 1
  • 1
alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

1 Answers1

4

You can use setNames in a call to lapply. Take the first sales column where you can calculate a percent, column 3, and loop through to the end of the sales columns. I used x-2 and x-1 to name the columns because they were appropriate here, but it will depend on the value of n:

d <- df["id"]

n <- 3L # column where your first percent can be calculated
d[,2:length((n-1):dim(df)[2])] <- 
    lapply(n:dim(df)[2], function(x) setNames(((df[x] - df[x-1]) / df[x-1]),
                                              paste0("sales", x-2, x-1)))
#  id sales12    sales23
#1  2   -0.50  0.2000000
#2 88   -0.25 -0.3333333
#3  4    0.00  0.5000000
Jota
  • 17,281
  • 7
  • 63
  • 93
  • does this also calculates the percent diff? i see that you put the formula inside setNames() function. why? – alwaysaskingquestions May 29 '16 at 23:51
  • You can multiply by 100 in the formula to get percent. The first argument to `setNames` is the thing you want to name, which is the result of the formula. – Jota May 29 '16 at 23:57
  • my question is why the formula is INSIDE the setNames() function? im not understanding why this works... shouldnt the formula be outside of the setNames()? and AFTER we calculate, THEN we'll set the names, right? however, i did test your code, it does work! I am just confused about the syntax. Thank you! – alwaysaskingquestions May 29 '16 at 23:58
  • You can do the naming afterwards if you want. You can definitely put `setNames` on the outside of the `lapply`, or use `names` in a later step. But from the `?setNames` page: "This is a convenience function that sets the names on an object and returns the object." I'm just naming the column that results from the calculation and returning it with a name. Hope that helps. – Jota May 30 '16 at 00:05
  • so basically, you do NOT have to have a defined x (aka input data) for setNames(); in fact, you can just create the x (input data) inside setNames(). Am I understanding this correctly? This is definitely a new syntax for me, but great to know that there's such shortcut option! – alwaysaskingquestions May 30 '16 at 00:08
  • Really interesting to see that someone thinks thinks this question was well-defined. – IRTFM May 30 '16 at 00:12