-1

This is a follow up question to the following problem give here

I have the following data

Data:

df = structure(list(Org_ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L), 
    Market_volume = c(100L, 200L, 300L, 50L, 500L, 400L, 200L, 
    300L, 100L), Indicator_variable = c(1L, 0L, 0L, 1L, 1L, 0L, 
    0L, 0L, 0L),variable3=c(10L, 1L, 1L, 4L, 2L, 3L, 3L, 10L, 3L),variable4=c(2L, 1L, 1L, 7L, 2L, 3L, 3L, 8L, 3L)).Names = c("Org_ID", "Market_volume", "Indicator_variable","Var3","Var4"
), class = "data.frame", row.names = c(NA, -9L))

Using (dplyr), i calculated the % of NA's by market volume by Org_ID via the following function

df %>%
  group_by(Org_ID) %>%
  summarize(sum_market_vol = sum(Market_volume*!Indicator_variable),
            tot_market_vol = sum(Market_volume)) %>%
  transmute(Org_ID, Perc_Market_Vol = 100*sum_market_vol/tot_market_vol)

Result:

# A tibble: 3 x 2
  Org_ID Perc_Market_Vol
   <int>           <dbl>
1      1        83.33333
2      2         0.00000
3      3       100.00000

Question: I want to subset my original data by deleting all rows of Org_ID (say 2) # X if perc_market_vol<30. That is i do not want to delete individual rows of the same org_id, but Org_id as a whole, say all counts of Org_id =1 or org_id = 2. How can i subset it linking two tables or functions?

I want the new data look like this:

df1 = structure(list(Org_ID = c(1L, 1L, 1L, 3L, 3L, 3L, 3L), 
    Market_volume = c(100L, 200L, 300L, 400L, 200L, 
    300L, 100L), Indicator_variable = c(1L, 0L, 0L, 0L, 
    0L, 0L, 0L),variable3=c(10L, 1L, 1L, 3L, 3L, 10L, 3L),variable4=c(2L, 1L, 1L, 3L, 3L, 8L, 3L)).Names = c("Org_ID", "Market_volume", "Indicator_variable","Var3","Var4"
), class = "data.frame", row.names = c(NA, -7L))
T.Z
  • 65
  • 7
  • _"Using (dplyr), i calculated the % of NA's by market volume by Org_ID via the following function"_ You might want to give some citation, since the code is copied straight from [here](https://stackoverflow.com/questions/47098184/summarizing-data-by-subgroups/47099096?noredirect=1#comment81286844_47099096) – acylam Nov 08 '17 at 02:31
  • UserR, i didn't know that it is possible to put citation. Thanks for 1. helping me with the answer above and 2. for showing me how i can put citation. Your code really helped me! Thank you! – T.Z Nov 08 '17 at 14:42
  • By citation, I mean linking to the original answer, and mentioning that this is a followup question to another question. Please do that in your question body itself instead of posting it as a comment. – acylam Nov 08 '17 at 14:45

1 Answers1

0

You can filter without materializing the aggregated data frame by using group_by %>% filter, and in the filter you can calculate the aggregated condition per group:

df %>% 
    group_by(Org_ID) %>% 
    filter(sum(Market_volume * !Indicator_variable)/sum(Market_volume) > 0.3)

# A tibble: 7 x 5
# Groups:   Org_ID [2]
#  Org_ID Market_volume Indicator_variable  Var3  Var4
#   <int>         <int>              <int> <int> <int>
#1      1           100                  1    10     2
#2      1           200                  0     1     1
#3      1           300                  0     1     1
#4      3           400                  0     3     3
#5      3           200                  0     3     3
#6      3           300                  0    10     8
#7      3           100                  0     3     3
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • Thanks Pdisom! It worked really well. Do you know how to save these results in the data (df) without writing it as csv and rereading it into R? – T.Z Nov 08 '17 at 16:22
  • Do you mean save and load like in [this blog](https://www.fromthebottomoftheheap.net/2012/04/01/saving-and-loading-r-objects/). – Psidom Nov 08 '17 at 16:40
  • 1
    Yes, right now i am doing write.csv (......, "data1.csv") and then load the data1.csv file. But i guess i can just do save() function. I will read the reference! Thanks Psidom! – T.Z Nov 08 '17 at 16:45