2

I have a R DataFrame that has a structure similar to the following:

df <- data.frame(var1 = c(1, 1), var2 = c(0, 2), var3 = c(3, 0), f1 = c('a', 'b'), f2=c('c', 'd') )

So visually the DataFrame would look like

> df
    var1 var2 var3 f1 f2
1    1    0    3    a  c
2    1    2    0    b  d

What I want to do is the following:

(1) Treat the first C=3 columns as counts for three different classes. (C is the number of classes, given as an input variable.) Add a new column called "class".

(2) For each row, duplicate the last two entries of the row according to the count of each class (separately); and append the class number to the new "class" column.

For example, the output for the above dataset would be

> df_updated
  f1 f2 class
1  a  c   1
2  a  c   3
3  a  c   3
4  a  c   3
5  b  d   1
6  b  d   2
7  b  d   2

where row (a c) is duplicated 4 times, 1 time with respect to class 1, and 3 times with respect to class 3; row (b d) is duplicated 3 times, 1 time with respect to class 1 and 2 times with respect to class 2.

I tried looking at previous posts on duplicating rows based on counts (e.g. this link), and I could not figure out how to adapt the solutions there to multiple count columns (and also appending another class column).

Also, my actual dataset has many more rows and classes (say 1000 rows and 20 classes), so ideally I want a solution that is as efficient as possible.

I wonder if anyone can help me on this. Thanks in advance.

Mathreader
  • 23
  • 2

2 Answers2

1

base R

Row order (and row names) notwithstanding:

tmp <- subset(reshape2::melt(df, id.vars = c("f1","f2"), value.name = "class"), class > 0, select = -variable)
tmp[rep(seq_along(tmp$class), times = tmp$class),]
#     f1 f2 class
# 1    a  c     1
# 2    b  d     1
# 4    b  d     2
# 4.1  b  d     2
# 5    a  c     3
# 5.1  a  c     3
# 5.2  a  c     3

dplyr

library(dplyr)
# library(tidyr) # pivot_longer
df %>%
  pivot_longer(-c(f1, f2), values_to = "class") %>%
  dplyr::filter(class > 0) %>%
  select(-name) %>%
  slice(rep(row_number(), times = class))
# # A tibble: 7 x 3
#   f1    f2    class
#   <chr> <chr> <dbl>
# 1 a     c         1
# 2 a     c         3
# 3 a     c         3
# 4 a     c         3
# 5 b     d         1
# 6 b     d         2
# 7 b     d         2
r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.

library(tidyverse)

df %>% 
  pivot_longer(starts_with("var"), names_to = "class") %>% 
  filter(value != 0) %>% 
  uncount(value) %>% 
  mutate(class = str_extract(class, "\\d+"))

Output

  f1    f2    class
  <chr> <chr> <chr>
1 a     c     1    
2 a     c     3    
3 a     c     3    
4 a     c     3    
5 b     d     1    
6 b     d     2    
7 b     d     2 

Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.

library(splitstackshape)

df %>% 
  pivot_longer(starts_with("var"), names_to = "class") %>% 
  filter(value != 0) %>% 
  expandRows("value") %>% 
  mutate(class = str_extract(class, "\\d+"))
AndrewGB
  • 16,126
  • 5
  • 18
  • 49
  • 1
    `uncount`, didn't know that one ... why is such a bespoke one-purpose (perhaps "rarely used") function being defined? I cannot think of one time I've actually wanted that functionality badly enough to formalize it in a package ... – r2evans Jan 28 '22 at 05:20
  • 1
    @r2evans It seems that it might be most useful if you need to get back to the raw data from the counts (perhaps you don't have access to anymore); at least that is what I gather from this post: https://scotinastats.rbind.io/post/the-tale-of-count-von-uncount/ – AndrewGB Jan 28 '22 at 05:31