Dataframe Structure Problems and Calculation of new Dataframe

Question

So I got this dataframe (dataframe 1):

# A tibble: 6 x 27
# Groups:   authority_dic, Full.Name [6]
  authority_dic Full.Name     Entity `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019` `4_2019` `6_2019` `8_2019` `9_2020` `10_2019` `11_2020` `9_2019`
  <chr>         <chr>         <chr>     <int>    <int>    <int>     <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>     <int>     <int>    <int>
1 pro           A              SALES     23       22        6         3       15        8        7        2        5        8       13        12         7        6
2 pro           B              WERNE     16       16        7        12        4       11        4       12       13        9        8         9         7       11
3 pro           C              EQT C      5       16        1         2        3        0        6        2        5        1        2         5         3        5
4 pro           D              T-MOB      7        7       16         2        8        6        4       10        4        2        6         3         1        2
5 leader        A              INSPE      2        1        1        15        0        9        1        7        8        4        2         8         2        6
6 pro           F              AXON       4        4        1         1        0       14        0       12        1        0        0         1         1        0
# ... with 10 more variables: `1_2020` <int>, `11_2019` <int>, `10_2020` <int>, `2_2020` <int>, `2_2019` <int>, `3_2019` <int>, `12_2020` <int>, `7_2019` <int>,
#   `6_2020` <int>, `5_2020` <int>

And this dataframe (dataframe 2):

# A tibble: 6 x 25
# Groups:   Full.Name [6]
  Full.Name  `1_2019` `1_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020` `6_2019` `6_2020` `7_2019` `7_2020` `8_2019` `8_2020` `9_2019`
  <chr>         <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>    <int>
1 A               556      220      577      213      628      102      608       58      547      321      429      233      371      370      338      334      409
2 B                 0       48        0        0        0        0       14       45        0        0       83        0        0       16        0        0        0
3 C                 0        0        0        0        0        0        0        0        0        0        0        0        0        0        0        0        0
4 D                 0        0        0        0        0      110        0       63       60      125       23        0       46       32        0       47       18
5 E                73      163       68      113      122        0       70       71      135      124       62      101       28       21       98       11      109
6 F                96      135       57       32      149      173      211      145       61       46      190      308       32      159      210      168        0
# ... with 7 more variables: `9_2020` <int>, `10_2019` <int>, `10_2020` <int>, `11_2019` <int>, `11_2020` <int>, `12_2019` <int>, `12_2020` <int>

Now I want to create another dataframe (dataframe 3) and it should contain all names of the column "full name" and divide all columns values of dataframe 1 "x_20xx" through the matching value of dataframe 2. It's kind of tricky, because unfortunately, both dataframes got the exact same column names for "x_20xx". Maybe we need to change that first in order to calculate the values for the new dataframe.

This is an example of what I want: Obviously with all the "x_20xx" columns.

  Full.Name                          `1_2019_freq` 
  <chr>                                  <int>    
1 A                      (value_1_2019 df1) / (value_1_2019 df2)     
2 B                      (value_1_2019 df1) / (value_1_2019 df2)
3 C                      (value_1_2019 df1) / (value_1_2019 df2)
4 D                      (value_1_2019 df1) / (value_1_2019 df2)
5 E                      (value_1_2019 df1) / (value_1_2019 df2)
6 F                      (value_1_2019 df1) / (value_1_2019 df2)

It's hard to explain, but I hope you get my point. It's a simple calculation, but I am not able to group all "A" and "B" [...] and sum their columns so that there is only one row containing "A" with all the other months but summed values for each month. After that, we only need to create another new dataframe and divide the values through the matching values of df2. Any help is appreciated. Many thanks for considering my request. My dput() output: Dataframe 1:

structure(list(authority_dic = c("pro", "pro", "pro", "pro", 
"leader", "pro", "know", "pro", "pro", "pro"), Full.Name = c("Marc R. Benioff", 
"Derek J. Leathers", "Robert J. McNally", "G. Michael Sievert", 
"Paul J. Sarvadi", "Patrick W. Smith", "John J. Legere", "John J. Legere", 
"Patrick K. Decker", "Michael J. Saylor"), Entity = c("SALESFORCE.COM INC", 
"WERNER ENTERPRISES INC", "EQT CORP (Rapidan Energy)", "T-MOBILE US INC", 
"INSPERITY INC", "AXON ENTERPRISE INC", "T-MOBILE US INC", "T-MOBILE US INC", 
"XYLEM INC", "MICROSTRATEGY INC"), `4_2020` = c(23L, 16L, 5L, 
7L, 2L, 4L, 9L, 4L, 2L, 2L), `3_2020` = c(22L, 16L, 16L, 7L, 
1L, 4L, 8L, 2L, 6L, 0L), `7_2020` = c(6L, 7L, 1L, 16L, 1L, 1L, 
3L, 1L, 3L, 0L), `12_2019` = c(3L, 12L, 2L, 2L, 15L, 1L, 4L, 
3L, 3L, 0L), `8_2020` = c(15L, 4L, 3L, 8L, 0L, 0L, 4L, 2L, 2L, 
0L), `5_2019` = c(8L, 11L, 0L, 6L, 9L, 14L, 7L, 6L, 9L, 0L), 
    `1_2019` = c(7L, 4L, 6L, 4L, 1L, 0L, 13L, 4L, 0L, 2L), `4_2019` = c(2L, 
    12L, 2L, 10L, 7L, 12L, 2L, 13L, 2L, 0L), `6_2019` = c(5L, 
    13L, 5L, 4L, 8L, 1L, 6L, 4L, 6L, 0L), `8_2019` = c(8L, 9L, 
    1L, 2L, 4L, 0L, 9L, 8L, 13L, 0L), `9_2020` = c(13L, 8L, 2L, 
    6L, 2L, 0L, 5L, 2L, 0L, 3L), `10_2019` = c(12L, 9L, 5L, 3L, 
    8L, 1L, 10L, 7L, 9L, 0L), `11_2020` = c(7L, 7L, 3L, 1L, 2L, 
    1L, 4L, 0L, 3L, 12L), `9_2019` = c(6L, 11L, 5L, 2L, 6L, 0L, 
    11L, 6L, 6L, 0L), `1_2020` = c(6L, 9L, 1L, 4L, 11L, 5L, 4L, 
    5L, 3L, 0L), `11_2019` = c(10L, 11L, 0L, 5L, 11L, 1L, 6L, 
    4L, 4L, 0L), `10_2020` = c(10L, 7L, 3L, 5L, 3L, 0L, 5L, 1L, 
    1L, 7L), `2_2020` = c(7L, 4L, 3L, 4L, 4L, 0L, 10L, 7L, 5L, 
    0L), `2_2019` = c(7L, 5L, 5L, 1L, 6L, 0L, 6L, 3L, 0L, 0L), 
    `3_2019` = c(4L, 9L, 1L, 6L, 6L, 3L, 6L, 10L, 4L, 0L), `12_2020` = c(3L, 
    10L, 3L, 5L, 0L, 2L, 2L, 1L, 3L, 10L), `7_2019` = c(7L, 9L, 
    4L, 2L, 6L, 0L, 3L, 8L, 1L, 0L), `6_2020` = c(3L, 3L, 1L, 
    3L, 1L, 2L, 2L, 2L, 8L, 0L), `5_2020` = c(2L, 7L, 0L, 7L, 
    4L, 2L, 3L, 4L, 1L, 0L)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
    authority_dic = c("know", "leader", "pro", "pro", "pro", 
    "pro", "pro", "pro", "pro", "pro"), Full.Name = c("John J. Legere", 
    "Paul J. Sarvadi", "Derek J. Leathers", "G. Michael Sievert", 
    "John J. Legere", "Marc R. Benioff", "Michael J. Saylor", 
    "Patrick K. Decker", "Patrick W. Smith", "Robert J. McNally"
    ), .rows = structure(list(7L, 5L, 2L, 4L, 8L, 1L, 10L, 9L, 
        6L, 3L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))

Dataframe 2:

structure(list(Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", 
"Aaron P. Jagdfeld", "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman", 
"John J. Legere", "Albert Bourla, D.V.M., DVM, Ph.D.", 
"Andres Ricardo Gluski Weilert", "Andrew Anagnost"), `1_2019` = c(556L, 
0L, 0L, 0L, 73L, 96L, 0L, 0L, 40L, 222L), `1_2020` = c(220L, 
48L, 0L, 0L, 163L, 135L, 0L, 209L, 0L, 110L), `2_2019` = c(577L, 
0L, 0L, 0L, 68L, 57L, 0L, 0L, 98L, 215L), `2_2020` = c(213L, 
0L, 0L, 0L, 113L, 32L, 0L, 66L, 0L, 133L), `3_2019` = c(628L, 
0L, 0L, 0L, 122L, 149L, 0L, 0L, 8L, 278L), `3_2020` = c(102L, 
0L, 0L, 110L, 0L, 173L, 0L, 243L, 0L, 120L), `4_2019` = c(608L, 
14L, 0L, 0L, 70L, 211L, 0L, 0L, 18L, 200L), `4_2020` = c(58L, 
45L, 0L, 63L, 71L, 145L, 0L, 315L, 0L, 111L), `5_2019` = c(547L, 
0L, 0L, 60L, 135L, 61L, 0L, 0L, 68L, 101L), `5_2020` = c(321L, 
0L, 0L, 125L, 124L, 46L, 0L, 100L, 0L, 166L), `6_2019` = c(429L, 
83L, 0L, 23L, 62L, 190L, 0L, 0L, 150L, 242L), `6_2020` = c(233L, 
0L, 0L, 0L, 101L, 308L, 112L, 128L, 0L, 201L), `7_2019` = c(371L, 
0L, 0L, 46L, 28L, 32L, 0L, 111L, 0L, 333L), `7_2020` = c(370L, 
16L, 0L, 32L, 21L, 159L, 0L, 199L, 43L, 15L), `8_2019` = c(338L, 
0L, 0L, 0L, 98L, 210L, 0L, 56L, 0L, 212L), `8_2020` = c(334L, 
0L, 0L, 47L, 11L, 168L, 0L, 207L, 0L, 40L), `9_2019` = c(409L, 
0L, 0L, 18L, 109L, 0L, 0L, 73L, 0L, 387L), `9_2020` = c(406L, 
91L, 5L, 0L, 40L, 164L, 23L, 183L, 65L, 28L), `10_2019` = c(330L, 
35L, 0L, 0L, 63L, 119L, 0L, 192L, 34L, 88L), `10_2020` = c(331L, 
0L, 0L, 0L, 25L, 185L, 71L, 451L, 43L, 29L), `11_2019` = c(448L, 
33L, 0L, 0L, 131L, 108L, 0L, 41L, 35L, 196L), `11_2020` = c(559L, 
0L, 0L, 44L, 83L, 86L, 49L, 803L, 0L, 132L), `12_2019` = c(300L, 
0L, 4L, 0L, 72L, 167L, 0L, 155L, 68L, 93L), `12_2020` = c(122L, 
0L, 0L, 0L, 0L, 54L, 0L, 790L, 0L, 1L)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), groups = structure(list(
    Full.Name = c("A. Patrick Beharelle", "Aaron P. Graft", "Aaron P. Jagdfeld", 
    "Adam H. Schechter", "Adam P. Symson", "Adena T. Friedman", 
    "Alan David Schnitzer", "Albert Bourla, D.V.M., DVM, Ph.D.", 
    "Andres Ricardo Gluski Weilert", "Andrew Anagnost"), .rows = structure(list(
        1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -10L), .drop = TRUE))

Take "John J. Legere" as an example.

So, John J. Legere shows up two times in `df1`, with different values in `"authority_dic"`: once as "know", once as "pro". They only show up once in `df2`. Do you want to divide both `df1` rows by the single `df2` row? Or something else? Do you need the `authority_dic` column in the results? — Gregor Thomas, May 19 '22 at 17:46
And, what do you want to happen to names in `df1` that don't exist in `df2`? Should they be included in the results but not divided? `NA` in the results? Omitted from the results? — Gregor Thomas, May 19 '22 at 17:50
Thank you for your comment! I want to sum up all names and add their values to a total. The `authority_dic` column isn't needed in the new df and all names of `df1` are also in `df2`. — Yeezus, May 19 '22 at 18:19
Here's the FAQ on [summing by group](https://stackoverflow.com/q/1660124/903061) - I'd recommend the `dplyr` answer. — Gregor Thomas, May 19 '22 at 20:12
To answer your other question that you just deleted, I think you could use `bind_rows( df %>% group_by(group = authority_dic) %>% summarize(across(`2019`:`2020`, sum)), df %>% group_by(group = Full.Name) %>% summarize(across(`2019`:`2020`, sum)) )` — Jon Spring, Jun 01 '22 at 18:13

langtang · Accepted Answer · 2022-05-19T19:38:20.080

As usual, working with wide data is hard, but working long is easy:

inner_join(
  pivot_longer(df1, cols=4:27, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"),
  pivot_longer(df2, cols=2:25, names_to="period") %>% group_by(Full.Name, period) %>% summarize(value=sum(value,na.rm=T), .groups="drop"), 
  by=c("Full.Name", "period")
) %>% 
  mutate(result = value.x/value.y) %>% 
  pivot_wider(id_cols = Full.Name, names_from="period", values_from="result")

Change to left_join() if you want to retain names from df1 that don't appear in df2

Output:

# A tibble: 1 × 25
  Full.Name      `1_2019` `1_2020` `10_2019` `10_2020` `11_2019` `11_2020` `12_2019` `12_2020` `2_2019` `2_2020` `3_2019` `3_2020` `4_2019` `4_2020` `5_2019` `5_2020`
  <chr>             <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 John J. Legere      Inf      Inf       Inf    0.0845       Inf    0.0816       Inf       Inf      Inf      Inf      Inf      Inf      Inf      Inf      Inf      Inf
# … with 8 more variables: `6_2019` <dbl>, `6_2020` <dbl>, `7_2019` <dbl>, `7_2020` <dbl>, `8_2019` <dbl>, `8_2020` <dbl>, `9_2019` <dbl>, `9_2020` <dbl>

Note that the group_by/summarize after pivoting df2 can be dropped if the there is only one row per Full.Name in that frame. Also, notice that the summing using na.rm=T, which is probably what you want, but also results in zeros in value.y, resulting in lots of dividing by zero.

Thank you for your work and time! This works for the calculation part, but it doesn't sum up duplicate names to a single row for each unique name in `Full.Name` and `authority_dic` isn't needed anymore, so it could be dropped. — Yeezus, May 19 '22 at 18:28
okay, if you want to sum across multiple rows, you can simply add a `group_by()` and `summarize()` after the pivot_wider, but before the merge. See my edit — langtang, May 19 '22 at 19:35

score 0 · Answer 2 · answered May 19 '22 at 17:55

I would use base R for this. This approach only includes names in the result that have matches in both data frames. In the sample data, John Legere is the only matching name, and all the corresponding values in df2 are 0, and R outputs Inf when we divide by 0.

If there are issues and more help is needed, I'd suggest updating the sample data so that there are more matches and that there are some non-0 values to divide by.

cols_to_divide = setdiff(intersect(names(df1), names(df2)), "Full.Name")
matching_names = intersect(df1$Full.Name, df2$Full.Name)
# initialize the results with the numerator
df3 = df1[df1$Full.Name %in% matching_names, c("Full.Name", "authority_dic", cols_to_divide)]
# find the right row numbers for the denominator
rows_to_divide = match(df3$Full.Name, df2$Full.Name)
# divide
df3[cols_to_divide] = df3[cols_to_divide] / df2[rows_to_divide, cols_to_divide]

df3
# # A tibble: 2 × 26
# # Groups:   authority_dic, Full.Name [2]
#   Full.Name      authority_dic `4_2020` `3_2020` `7_2020` `12_2019` `8_2020` `5_2019` `1_2019`
#   <chr>          <chr>            <dbl>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
# 1 John J. Legere know               Inf      Inf      Inf       Inf      Inf      Inf      Inf
# 2 John J. Legere pro                Inf      Inf      Inf       Inf      Inf      Inf      Inf
# # … with 17 more variables: 4_2019 <dbl>, 6_2019 <dbl>, 8_2019 <dbl>, 9_2020 <dbl>,
# #   10_2019 <dbl>, 11_2020 <dbl>, 9_2019 <dbl>, 1_2020 <dbl>, 11_2019 <dbl>, 10_2020 <dbl>,
# #   2_2020 <dbl>, 2_2019 <dbl>, 3_2019 <dbl>, 12_2020 <dbl>, 7_2019 <dbl>, 6_2020 <dbl>,
# #   5_2020 <dbl>

Dataframe Structure Problems and Calculation of new Dataframe

2 Answers2