Reducing dataframe if first 4 columns match

Question

I have a dataframe that looks like this

> head(printing_id_map_unique_frames)
# A tibble: 6 x 5
# Groups:   frame_number [6]
     X1    X2    X3 row_in_frame frame_number
  <dbl> <dbl> <dbl>        <dbl>        <dbl>
1     1     2     3           15            1
2     1     2     3           15            2
3     1     2     3           15            3
4     1     2     3           15            4
5     1     2     3           15            5
6     1     2     3           15            6

As you can see, X1,X2,X3, row_in_frame is identical

However, eventually you get to a

     X1    X2    X3 row_in_frame frame_number
  <dbl> <dbl> <dbl>        <dbl>        <dbl>
1     1     2     3           15           32
2     1     2     3           15           33
3     1     2     3            5           34**
4     1     4     5           15           35
5     1     4     5           15           36

What I would like to do is essentially compute a dataframe that looks like:

        X1    X2    X3        row_in_frame num_duplicates
      <dbl> <dbl> <dbl>        <dbl>        <dbl>
    1     1     2     3           15           33
    2     1     2     3           5            1
...

Essentially, what I want is to "collapse" over identical first 4 columns and count how many rows of that type there are in the "num_duplicates" column.

Is there a nice way to do this in dplyr without a messy for loop that tracks a count and if there is a change.

Below please find a full data structure via dput:

> dput(printing_id_map_unique_frames)
structure(list(X1 = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X2 = c(2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4
), X3 = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 5, 5, 5, 5, 5, 5), row_in_frame = c(15, 15, 15, 15, 
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5, 15, 15, 
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 
15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 5
), frame_number = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 
30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 
46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 
62, 63, 64, 65, 66, 67, 68)), row.names = c(NA, -68L), class = c("tbl_df", 
"tbl", "data.frame"))

You may need `count(df1, X1, X2, X3, row_in_frame)` – akrun Jul 02 '20 at 21:02 — akrun, Jul 02 '20 at 21:02

akrun · Accepted Answer · 2020-07-02T21:19:27.350

Here is one option with count

library(dplyr) # 1.0.0
df1 %>%
       count(!!! rlang::syms(names(.)[1:4]))

Or specify the unquoted column names

df1 %>%
    count(X1, X2, X3, row_in_frame)

If we don't want to change the order, an option is to convert the first 4 columns to factor with levels specified as the unique values (which is the same as the order of occurrence of values) and then apply the count

df1 %>%
   mutate(across(1:4, ~ factor(.x, levels = unique(.x)))) %>%
   count(!!! rlang::syms(names(.)[1:4])) %>% 
   type.convert(as.is = TRUE)
# A tibble: 4 x 5
#     X1    X2    X3 row_in_frame     n
#  <int> <int> <int>        <int> <int>
#1     1     2     3           15    33
#2     1     2     3            5     1
#3     1     4     5           15    33
#4     1     4     5            5     1

is there a way without changing the order? The count command reshuffles the data. — user1357015, Jul 02 '20 at 21:05
@user1357015. Eventhough, the `sort = FALSE` is by default in `count`, it is reordering. An optin is to change it to `factor` (updated) — akrun, Jul 02 '20 at 21:16

Reducing dataframe if first 4 columns match

1 Answers1