3

I have a dataframe of names and years, with a dummy variable for whether the name occurred in a year or not.

I'm trying to create a dataframe which tell me

    1. the total number names which appeared in that year, and
    1. the number of those which appeared in that year but not in the year before.

In the below example, in 2017 there is only one person occurring (Terry) and nothing for the previous year, so both total and new would be 1. In 2018 three people occur but only two are new as Terry occurred in the previous year. If somebody appeared in 2017 and 2019 but not in 2018, they should be classed as new in 2019.

EXAMPLE

   Name x2017 x2018 x2019
1 Terry     1     1     0
2   Sam     0     0     1
3   Nic     0     1     1
4 Sarah     0     1     1

CODE

data.frame(
  Name = c("Terry", "Sam", "Nic", "Sarah"), 
  x2017 = c(1, 0, 0, 0), 
  x2018 = c(1, 0, 1, 1), 
  x2019 = c(0, 1, 1, 1)
  )

OUTPUT I'M TRYING TO CREATE

  Year Total New
1 2017     1   1
2 2018     3   2
3 2019     3   1

I've tried filtering and using row sums, but I feel like there's a function which I don't know of that can do this.

Thanks!

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
Sean
  • 99
  • 6

4 Answers4

3

The mutate(new = as.numeric(values == 1 & lag(values) == 0), new = ifelse(is.na(new), values, new)) %>%part is from stefan (credits to him, thank you stefan). The difference is parse_number

library(tidyverse)
df %>% 
  pivot_longer(
    cols = -Name,
    names_to = "Year", 
    values_to = "values"
  ) %>% 
  mutate(Year = parse_number(Year)) %>% 
  mutate(new = as.numeric(values == 1 & lag(values) == 0),
         new = ifelse(is.na(new), values, new)) %>% 
  group_by(Year) %>% 
  summarise(Total = sum(values), New = sum(new))

output:

   Year Total   New
* <dbl> <dbl> <dbl>
1  2017     1     1
2  2018     3     2
3  2019     3     1
TarJae
  • 72,363
  • 6
  • 19
  • 66
  • Nice solution. At first I suspected why I can't think of a more concise solution. Apparently there wasn't any about this question lol. – Anoushiravan R May 30 '21 at 21:35
3

Updated-02 I am so sorry as I had to revise my solution, cause I realized that only when a name has occurred in the previous year it is not considered new so you can use also use this for your sample data and the one presented below the page:

library(dplyr)
library(purrr)


df %>% 
  summarise(across(2:4, ~ sum(.x))) %>%
  bind_cols() %>% 
  pivot_longer(everything(), names_to = "Year", values_to = "Total", 
               names_prefix = "x") %>%
  left_join(df %>% select(2:4) %>% pmap_dfr(~ {x <- c(...); x - lag(x, default = 0)}) %>% 
              summarise(across(everything(), ~ sum(.x == 1))) %>% 
              pivot_longer(everything(), names_to = "Year", values_to = "New", 
                           names_prefix = "x"), 
            by = "Year")
 

# A tibble: 3 x 3
  Year  Total   New
  <chr> <dbl> <dbl>
1 2017      1     1
2 2018      3     2
3 2019      3     1
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
  • 1
    I was happy to see your solution. This gave me now the power to learn `purrr`. Thank you Anoushiravan. Maybe you are interested in this question of mine – TarJae May 30 '21 at 21:38
  • 1
    Oh yes I saw it today but was a bit late as some of my prospective solutions had been already posted so I just voted and sit out. Bravo by the way it was a very good question. – Anoushiravan R May 30 '21 at 21:46
  • 1
    These two questions might also be of interest to you: https://stackoverflow.com/questions/66833328/using-pmap-to-iterate-over-rows-of-a-tibble https://stackoverflow.com/questions/67049561/using-pmap-with-c-part-2 – Anoushiravan R May 30 '21 at 21:49
2

Maybe this is what you are looking for:

  1. Reshape to long format using e.g. tidy::pivot_longer
  2. Group by Name and make use of dplyr::lag to add an indicator whether a person is new
  3. Summarise by year
d <- data.frame(
  Name = c("Terry", "Sam", "Nic", "Sarah"), 
  x2017 = c(1, 0, 0, 0), 
  x2018 = c(1, 0, 1, 1), 
  x2019 = c(0, 1, 1, 1)
) 

library(dplyr)
library(tidyr)

d %>% 
  tidyr::pivot_longer(-Name, names_to = "year") %>% 
  mutate(year = gsub("^x", "", year)) %>%
  group_by(Name) %>% 
  mutate(new = as.numeric(value == 1 & lag(value) == 0),
         new = ifelse(is.na(new), value, new)) %>% 
  ungroup() %>% 
  group_by(year) %>% 
  summarise(total = sum(value), new = sum(new))
#> # A tibble: 3 x 3
#>   year  total   new
#>   <chr> <dbl> <dbl>
#> 1 2017      1     1
#> 2 2018      3     2
#> 3 2019      3     1
stefan
  • 90,330
  • 6
  • 25
  • 51
  • this won't give you correct results in many cases. It checks whether a record is available in just previous year and not in `any of the previous years` Try this on `structure(list(Name = c("Terry", "Sam", "Nic", "Sarah"), x2017 = c(1, 0, 0, 1), x2018 = c(1, 0, 1, 0), x2019 = c(0, 1, 1, 1)), row.names = c(NA, -4L), class = "data.frame")` – AnilGoyal May 31 '21 at 10:43
  • @AnilGoyal. The OP defined new as *"If somebody appeared in 2017 and 2019 but not in 2018, they should be classed as new in 2019."*. That's why I only checked for the previous year. – stefan May 31 '21 at 11:23
  • Ok. Thanks for clarification, I will revise my answer shortly. :) – AnilGoyal May 31 '21 at 11:31
2

case-I When records have to be checked in one previous row only.

df %>%
  pivot_longer(!Name, names_to = 'Year', names_prefix = 'x') %>%
  group_by(Year) %>%
  summarise(total = sum(value),
            new = list(Name[value == 1]), .groups = 'drop') %>%
  mutate(new = map2_int(new, lag(new), ~ sum(!(.x %in% .y))))

# A tibble: 3 x 3
  Year  total   new
  <chr> <dbl> <int>
1 2017      1     1
2 2018      3     2
3 2019      3     1

Case-II when records have to be looked into all previous rows. Using both map_* and accumulate. The strategy adopted-

  • pivot_longer first. drop x from years directly here using names_prefix argument
  • group_by on year then
  • calculate total values n() and names for that year in a list
  • mutate new using map2_int with first argument as that list only, and second argument as accumulated and lagged list .
  • map2_int thus calculates the total number of TRUE in that row.

library(tidyverse)
df %>%
  pivot_longer(!Name, names_to = 'Year', names_prefix = 'x') %>%
  group_by(Year) %>%
  summarise(total = sum(value),
         new = list(Name[value == 1]), .groups = 'drop') %>%
  mutate(new = map2_int(new, lag(accumulate(new, union, .init = first(new))[-1]), ~ sum(!(.x %in% .y))))

#> # A tibble: 3 x 3
#>   Year  total   new
#>   <chr> <int> <int>
#> 1 2017      1     1
#> 2 2018      3     2
#> 3 2019      3     1

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45