1

I have a dataframe with more than 1 million rows and multiple columns.

I need to do a group by on the first three columns (Month, Prod, Rate) and then on a fourth new column I need to summarize following operation: add a value in the current row + add a value in the previous month (which can be the previous one, but I can not confirm this because of the multiple group by that I need to do.

I've tried with this:

Bdtest<- BDdata %>%
      group_by(Month, PROD, Currency) %>%
      arrange(Month) %>%
      summarise(Value = Amount, lag(Amount2, default = 0))

But I'm still not getting the result that I want to.

At the end I would like a result like this:

 Month    Prod    Currency   Value   
  1        Fix       USD       value = amount of this period + amount from previous period
  1        Var        USD       x2

Maybe a good solution could be make a join with two separate tables, and on that way I could join making the sum, but can I make in the same movement a Group By?

EDIT:

glimpse(bd5)
Observations: 1,843,645
Variables: 65

  $ Rate                 <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", ...
  $ Currency                    <chr> "998", "998", "998", "998", "998", "998", "998", "998", "998", "998", ...
  $ Flow_Type                <chr> "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", ...
  $ BANDA_TEMP                <chr> "01", "02", "02", "03", "03", "03", "04", "04", "05", "05", "05", "01"...
  $ PeriodoLow                <fct> 1M, 1M, 2M, 3M, 3M, 3M, 6M, 6M, 9M, 9M, 9M, 1M, 1M, 2M, 3M, 3M, 3M, 6M...
  $ PeriodoHigh               <fct> 2M, 2M, 3M, 6M, 6M, 6M, 9M, 9M, 1Y, 1Y, 1Y, 2M, 2M, 3M, 6M, 6M, 6M, 9M...
  $ DaysProp_CapHigh        <dbl> 329751.000, 5514.083, 0.000, 229447.533, 115113.678, 3725.489, 224284....
  $ DaysProp_CapLow        <dbl> 0.000, 325330.917, 331938.000, 103621.467, 219087.322, 331568.511, 112...

As you asked, here I put more data about my problem. So I have this huge Df where I need to create a final grouping by first fourth variables + Period.

And then finally making a sum to obtain Cap which will be: the current value of "DaysProp_CapLow" + "DaysProp_CapHigh" from previous month (or year).

On first month will be just: DaysProp_CapLow

But from second month will be: DaysProp_CapLow + DaysProp_CapHigh (from first month)

I have been trying with lag, but I have problems combining it with grouping by and summarize

Any help will be apreciated. Am trying to do this now joining tables.

  • 1
    If your columns are arranged could you try using the lag() command in your call? – elmuertefurioso Dec 02 '18 at 18:50
  • please provide some sample input – Vivek Kalyanarangan Dec 02 '18 at 19:14
  • More input was added, thanks. – Pablo Tapia Varela Dec 02 '18 at 19:46
  • @elmuertefurioso I don't know actually how to do the "group by" adding these values and summarizing. – Pablo Tapia Varela Dec 02 '18 at 19:47
  • @PabloTapiaVarela: Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Dec 02 '18 at 22:32
  • First: summarise creates ONE row for each group.. you should use mutate instead. Second: Use arrange before group and then mutate(value = cumsum(Amount)) – Ika8 Dec 03 '18 at 08:27

1 Answers1

2

Without a reprex I can't be sure, but based on your description I believe this should work:

Bdtest <- BDdata %>%
  group_by(Month, PROD, Currency) %>%
  arrange(Month) %>%
  mutate(Value = Amount + lag(Amount2, default = 0))
dmca
  • 675
  • 1
  • 8
  • 18