0

I have a data frame that I need to group by a combination of columns entries in order to conditionally mutate several columns using only an if statement. I tried following the code posted here Using dplyr to group_by and conditionally mutate only with if (without else) statement. However, my problem is slightly different.

I want to sum data (D1) by county between the months of October of t-1 year and September of t year. Month variable = Mth. As you can see below, the sum of D1 from 10/2004 to 09/2005 is 300. My dataset is more than 100K rows for the entire US for a period of 15 years. For the purpose of demonstration, I have shown just two years for just one county. In my dataset there will be more counties, I have to address that in the code.

Year    County  St  D1  Mth Output
2005    Multon  OR  20  10  NA
2005    Multon  OR  0   10  NA
2005    Multon  OR  10  9   300
2005    Multon  OR  10  9   NA
2005    Multon  OR  20  8   NA
2005    Multon  OR  20  8   NA
2005    Multon  OR  30  7   NA
2005    Multon  OR  10  7   NA
2005    Multon  OR  0   6   NA
2005    Multon  OR  20  6   NA
2005    Multon  OR  10  5   NA
2005    Multon  OR  30  5   NA
2005    Multon  OR  0   4   NA
2005    Multon  OR  30  4   NA
2005    Multon  OR  20  3   NA
2005    Multon  OR  10  3   NA
2005    Multon  OR  30  2   NA
2005    Multon  OR  10  2   NA
2005    Multon  OR  0   1   NA
2005    Multon  OR  10  1   NA
2004    Multon  OR  0   12  NA
2004    Multon  OR  20  12  NA
2004    Multon  OR  0   11  NA
2004    Multon  OR  10  10  NA
2004    Multon  OR  10  9   80
2004    Multon  OR  20  8   NA
2004    Multon  OR  20  8   NA
2004    Multon  OR  30  7   NA

Can someone please help me with this query? I am learning R through this project.

Thanks!

Ritika

  • 1
    Hi Rikita, Can you edit your question and re-copy your code for the dataset. It looks like the beginning is missing. – Bloxx Feb 09 '22 at 21:15
  • @Bloxx Thank you for reading my question. But I checked my dataset again. It was copied correctly. Could you please explain what part do you think is missing? – Ritika Khurana Feb 09 '22 at 22:46
  • I think that `.internal.selfref` element is strange. Remove it, and it works. When you paste it into a fresh R session, does it throw an error? – wibeasley Feb 09 '22 at 23:13
  • 1
    I'm glad you shortened the dataset (I'm imaging this is 10k+ rows), but it's poorly suited for your question because (a) it's all the same county, (b) it's all the same time period, (c) `D1` is always zero, and (d) it has a bunch of unnecessary columns. Only `MapDate`, `County`, `State`, and `D0` are required for this, correct? Also, please include an example of the expected dataset. This is more thoroughly described in [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269/1082435) – wibeasley Feb 09 '22 at 23:20
  • 1
    @wibeasley, the `internal.selfref` is normal for `data.table` objects. It shouldn't be included in the question, it should be edited out (and optionally wrapped in `setDT(.)`, but it's normal nonetheless. (Its presence means that one cannot just copy that text and paste into R, it will complain about it.) I fixed the data, it is now usable, though it does require (as its class suggests) that `data.table` be loaded. – r2evans Feb 10 '22 at 00:22
  • 1
    @RitikaKhurana, your data is almost completely invariant, it would really help if you gave us representative data ***and the expected output*** for that sample data, as suggested already. This sounds like "summarize by group", which is a FAQ, with one such Q/A here: https://stackoverflow.com/q/11562656/3358272. Hope this helps. – r2evans Feb 10 '22 at 00:23
  • @r2evans Thank you for responding! I hope this helps now. Sorry, I couldn't figure out how to bring the appropriate dataset from R to present here. So, I made the dataset in excel to explain my query. – Ritika Khurana Feb 15 '22 at 08:13
  • The data is improved, thank you., but there are still questions/problems. You say *"by county"*, yet all of Multon's `D1` does not add up to 300. Further, Multon is missing most of the months in between; while "missing months" is not in itself a problem, the fact that Multon gets an output yet Albany (the only other county with a month or 9) does not. I'm inferring that the output only goes in month 9, all other months are supposed to remain `NA`. – r2evans Feb 15 '22 at 14:14
  • @r2evans I changed the data. This time I have only used one county to explain it better. In my dataset there are more counties, I couldn't present it here. Taking this as an example, (say for Multon, I need to sum all D1 for 10th of t-1 year to 9th of t year). I have shown that for 2004 as well but the sum (80) is only for the data presented here. For all other months, I can be missing or something number, I would collapse it later anyway. Just to remind, I want to add this as a column in a dataframe. That's why I am using 'mutate'. – Ritika Khurana Feb 16 '22 at 00:44

1 Answers1

1

As best as I can tell, this is what you want:

df %>% 
  mutate(
    grouper = Year + (Mth >= 10),
    grouper = paste("Oct", grouper - 1, "- Sept", grouper)
  ) %>%
  group_by(grouper, County) %>%
  mutate(Output = sum(D1))
# # A tibble: 28 × 7
# # Groups:   grouper, County [3]
#     Year County St       D1   Mth Output grouper             
#    <int> <chr>  <chr> <int> <int>  <int> <chr>               
#  1  2005 Multon OR       20    10     20 Oct 2005 - Sept 2006
#  2  2005 Multon OR        0    10     20 Oct 2005 - Sept 2006
#  3  2005 Multon OR       10     9    300 Oct 2004 - Sept 2005
#  4  2005 Multon OR       10     9    300 Oct 2004 - Sept 2005
#  5  2005 Multon OR       20     8    300 Oct 2004 - Sept 2005
#  6  2005 Multon OR       20     8    300 Oct 2004 - Sept 2005
#  7  2005 Multon OR       30     7    300 Oct 2004 - Sept 2005
#  8  2005 Multon OR       10     7    300 Oct 2004 - Sept 2005
#  9  2005 Multon OR        0     6    300 Oct 2004 - Sept 2005
# 10  2005 Multon OR       20     6    300 Oct 2004 - Sept 2005
# # … with 18 more rows

Using this sample data:

df = read.table(text = 'Year    County  St  D1  Mth Output
2005    Multon  OR  20  10  NA
2005    Multon  OR  0   10  NA
2005    Multon  OR  10  9   300
2005    Multon  OR  10  9   NA
2005    Multon  OR  20  8   NA
2005    Multon  OR  20  8   NA
2005    Multon  OR  30  7   NA
2005    Multon  OR  10  7   NA
2005    Multon  OR  0   6   NA
2005    Multon  OR  20  6   NA
2005    Multon  OR  10  5   NA
2005    Multon  OR  30  5   NA
2005    Multon  OR  0   4   NA
2005    Multon  OR  30  4   NA
2005    Multon  OR  20  3   NA
2005    Multon  OR  10  3   NA
2005    Multon  OR  30  2   NA
2005    Multon  OR  10  2   NA
2005    Multon  OR  0   1   NA
2005    Multon  OR  10  1   NA
2004    Multon  OR  0   12  NA
2004    Multon  OR  20  12  NA
2004    Multon  OR  0   11  NA
2004    Multon  OR  10  10  NA
2004    Multon  OR  10  9   80
2004    Multon  OR  20  8   NA
2004    Multon  OR  20  8   NA
2004    Multon  OR  30  7   NA', header = T)
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
  • Thank you for answering my query. When I run this code, my 'Output' column is Null. Just to give you a heads up. Output column in my dataset is the desired column that should be generated after running this block of commands. – Ritika Khurana Feb 17 '22 at 01:03
  • I don't know what to tell you. My answer is fully reproducible - if you restart R, load `dplyr`, read in the sample data you provided using the code I provided, and run the code, you should get exactly the results shown in my answer. Please verify that's the case, and then look for differences between the sample data and your actual data. Are you filling in your real data's name instead of `df`? Are you assigning the result, e.g., `df <- df %>% ...` to overwrite your existing data or `new_data <- df %>% ...` to assign it to a new object? Is anything else different? – Gregor Thomas Feb 17 '22 at 01:28
  • Confirmed that I can also recreate the desired output column with this answer – Nick Camarda Apr 16 '22 at 19:41