1

I would like to merge rows in the my data frame by unique emails, but I do not want to lose any data. To do this I would like the function to combine rows with the same email address. Along with this, if there happens to be overlapping data for an email address that I am trying to combine into one, I want the data from the row with less cells filled in to be added into a new column. Please as questions because I know that I am not explaining this very clearly.

Below is an example of what I am looking for the function to do (data made up).

First Name Last Name Email Phone Address Shoe Size
John Schmitt jschmitt@gmail.com 914-392-1840 address 1 4
Paul Johnson pjohnson@gmail.com 274-184-3653 address 2 2
Brad Arnold barnold@gmail.com 157-135-3175 address 3 5
John Schmitt jschmitt@gmail.com 914-392-1840 6

This sheet should become:

First Name Last Name Email Phone Address Shoe Size Shoe Size 2
John Schmitt jschmitt@gmail.com 914-392-1840 address 1 4 6
Paul Johnson pjohnson@gmail.com 274-184-3653 address 2 2
Brad Arnold barnold@gmail.com 157-135-3175 address 3 5

Basically, the phone number connected to forjschmitt@gmail.com stays in the "Phone" column because it is the same for both rows. Even though the rows are not the same for the address, because the bottom row is blank, it stays the same. Finally, a new column is created for Shoe Size, because there are two differing values for the rows that we are merging. The way that the function should pick which Shoe size to put in Shoe Size 2 is by looking at the number of cells in each row. The shoe size in the row with more cells filled goes in the original Shoe Size column. The shoe size in the row with less cells filled goes in the new Shoe Size 2 column.

Feel free to ask any questions or make any suggestions about how I could do something of this nature in an easier way. I also haven't figured out what to do if the two rows with conflicting data have the same number of cells filled...

Martin Gal
  • 16,640
  • 5
  • 21
  • 39
  • And if anyone can make my sheets into sheets on here so you can understand better! – Elijah Appelson Aug 23 '21 at 19:44
  • Please provide a [reproducible minimal example](https://stackoverflow.com/q/5963269/8107362). Especially, provide some sample data, e.g. with `dput()` and your expected result. – mnist Aug 23 '21 at 21:18

1 Answers1

1

Update: tidyverse only solution with the note of Martin Gal using chop

df %>% 
  select(-Address) %>% 
  chop(`Shoe Size`) %>% 
  unnest_wider(`Shoe Size`) %>% 
  rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>% 
  left_join(df, by= "Shoe Size") %>% 
  select(-contains(".y")) %>%
  rename_with(~str_remove(., '.x')) %>% 
  relocate(Address, .after = Phone) %>% 
  arrange(Address)

First answer:

Here is a way how we could achieve the result. The logic:

  1. remove Address and assgin to new df1
  2. use aggregate to basically combine the duplicate parts of rows and aggregate the not duplicate part (here: Shoe Size)
  3. Use unnest_wider to unnest the list column
  4. rename
  5. left_join with df and clean with select, rename_with
  6. relocate and arrange
library(dplyr)
library(tidyr)

# base R remove column Address and assign to df1
df1 <- df[,-5]

# aggregate Shoe Size (I don´t know how to do this in dplyr, therefore base R)
df1 <- aggregate(df1[5], df1[-5], unique)

# now with tidyverse(dpylr, tidyr)
df1 %>% 
    unnest_wider(`Shoe Size`) %>% 
    rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>% 
    left_join(df, by= "Shoe Size") %>% 
    select(-contains(".y")) %>%
    rename_with(~str_remove(., '.x')) %>% 
    relocate(Address, .after = Phone) %>% 
    arrange(Address)
# A tibble: 3 x 7
  `First Name` `Last Name` Email              Phone        Address   `Shoe Size` `Shoe Size 2`
  <chr>        <chr>       <chr>              <chr>        <chr>           <dbl>         <dbl>
1 John         Schmitt     jschmitt@gmail.com 914-392-1840 address 1           4             6
2 Paul         Johnson     pjohnson@gmail.com 274-184-3653 address 2           2            NA
3 Brad         Arnold      barnold@gmail.com  157-135-3175 address 3           5            NA

data:

structure(list(`First Name` = c("John", "Paul", "Brad", "John"
), `Last Name` = c("Schmitt", "Johnson", "Arnold", "Schmitt"), 
    Email = c("jschmitt@gmail.com", "pjohnson@gmail.com", "barnold@gmail.com", 
    "jschmitt@gmail.com"), Phone = c("914-392-1840", "274-184-3653", 
    "157-135-3175", "914-392-1840"), Address = c("address 1", 
    "address 2", "address 3", NA), `Shoe Size` = c(4, 2, 5, 6
    )), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
TarJae
  • 72,363
  • 6
  • 19
  • 66