10

My data looks like this:

data("Titanic")
df <- as.data.frame(Titanic)

How can I de-aggregate or reverse-summarise count/freq and expand the data set back to it's original non-count observation state?

For instance, I want 3rd, Male, Child, No repeated 35 times and 1st, Female, Adult, Yes repeated 140 times, etc, etc, in the dataframe.

Thanks in advance.

emehex
  • 9,874
  • 10
  • 54
  • 100

4 Answers4

7

You can do this with list columns and a few dplyr/tidyr/purrr verbs. It's not as compact as other base R solutions may be, but for me it's easier to understand how things fit together and it works in a larger tidyverse pipeflow.

As a check first-off, we're expecting 2,201 rows in the final data frame:

library(dplyr)
library(tidyr)
library(purrr)

sum(df$Freq)
#> [1] 2201

Converting this to a tibble makes it easier to see and work with the list column. I'm using purrr::map to move along the Freq column, creating a vector of a dummy marker the length of the Freq value. In this case, that marker is just "1"; it could instead be TRUE or anything else. The point it just that it will create a vector of the length Freq.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x)))
#> # A tibble: 32 x 6
#>    Class Sex    Age   Survived  Freq obs        
#>    <fct> <fct>  <fct> <fct>    <dbl> <list>     
#>  1 1st   Male   Child No           0 <dbl [0]>  
#>  2 2nd   Male   Child No           0 <dbl [0]>  
#>  3 3rd   Male   Child No          35 <dbl [35]> 
#>  4 Crew  Male   Child No           0 <dbl [0]>  
#>  5 1st   Female Child No           0 <dbl [0]>  
#>  6 2nd   Female Child No           0 <dbl [0]>  
#>  7 3rd   Female Child No          17 <dbl [17]> 
#>  8 Crew  Female Child No           0 <dbl [0]>  
#>  9 1st   Male   Adult No         118 <dbl [118]>
#> 10 2nd   Male   Adult No         154 <dbl [154]>
#> # … with 22 more rows

Then tidyr::unnest creates a row for each element in that dummy vector. After that, I drop those last 2 columns to just have the important categories of class, sex, age, and survival.

df %>%
  as_tibble() %>%
  mutate(obs = map(Freq, ~rep_len(1, .x))) %>%
  unnest() %>%
  select(-Freq, -obs)
#> # A tibble: 2,201 x 4
#>    Class Sex   Age   Survived
#>    <fct> <fct> <fct> <fct>   
#>  1 3rd   Male  Child No      
#>  2 3rd   Male  Child No      
#>  3 3rd   Male  Child No      
#>  4 3rd   Male  Child No      
#>  5 3rd   Male  Child No      
#>  6 3rd   Male  Child No      
#>  7 3rd   Male  Child No      
#>  8 3rd   Male  Child No      
#>  9 3rd   Male  Child No      
#> 10 3rd   Male  Child No      
#> # … with 2,191 more rows

In the end, it is in fact a data frame of 2,201 rows.

camille
  • 16,432
  • 18
  • 38
  • 60
5

Without packages we can repeat each row according to the frequencies given:

df2 <- df[rep(1:nrow(df), df[,5]),-5]
Pierre L
  • 28,203
  • 6
  • 47
  • 69
4

You can do this using the function untable in reshape.

data("Titanic")
df <- as.data.frame(Titanic)

library(reshape)
newDf = untable(df[,1:4], num = df[,5])
David
  • 11,245
  • 3
  • 41
  • 46
3

Chiming in with another tidyr approach, using tidyr::uncount.

library(tidyverse)

original <- tibble(x = c(1,1,1,2,2,2,4,4,4))
aggregated <- original %>% count(x)
deaggregated <- aggregated %>% uncount(weights = n)
jsavn
  • 701
  • 1
  • 8
  • 17