0

So let's say we have this df:

a = c(rep(1,5),rep(0,5),rep(1,5),rep(0,5))
b = c(rep(4,5),rep(3,5),rep(2,5),rep(1,5))
c = c(rep("w",5),rep("x",5),rep("y",5),rep("z",5))
df = data.frame(a,b,c)
df = df %>% 
  nest(data=c(a,b))

I want to use parameters from inside the nested "data" column to do things to the entire dataframe, for example use filter() to eliminate rows where the sum of "a" inside the nested "data" is equal to 0. Or to arrange the rows of the dataframe by the max() of b. How can I do this?

I cam up with a pretty dumb way of doing this, but I am not happy, as this isn't really applicable to the larger datasets I'm working with:

sum_column = function(df){

  df = df %>% 
    summarize(value=sum(a))
  return(df[[1]][1])
}

#so many a new column with the sum of a, and THEN filter by that
df = df %>% 
  mutate(sum_of_a = map(data, ~sum_column(.x))) %>% 
  filter(!sum_of_a==0)
         
Phil
  • 7,287
  • 3
  • 36
  • 66

1 Answers1

1

map returns a list, perhaps you want map_dbl?

library(dplyr)
library(purrr)
df %>% 
  mutate(sum_of_a = map_dbl(data, ~ sum(.x$a))) %>% 
  filter(!sum_of_a == 0)
# # A tibble: 2 × 3
#   c     data             sum_of_a
#   <chr> <list>              <dbl>
# 1 w     <tibble [5 × 2]>        5
# 2 y     <tibble [5 × 2]>        5

or more directly (in case you no longer need sum_of_a):

df %>% 
  filter(abs(map_dbl(data, ~ sum(.x$a))) > 0)
# # A tibble: 2 × 2
#   c     data            
#   <chr> <list>          
# 1 w     <tibble [5 × 2]>
# 2 y     <tibble [5 × 2]>

(The only reason I changed from ! . == 0 to abs(.) > 0 is due to floating-point tests of equality, not wanting to assume the precision and scale of numbers you're actually using. C.f., Why are these numbers not equal?, https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • thank you! that is a much more efficient solution. how would you order the rows by max(b)? – Alex Markov Feb 09 '23 at 16:12
  • 1
    add `arrange(map_dbl(data, ~ max(.$b)))` or explicitly bring it out to its own column with `mutate(maxb = map_dbl(data, ~ max(.$b)))` – r2evans Feb 09 '23 at 16:21