Merging rows based on unique emails with overlapping data

Question

I would like to merge rows in the my data frame by unique emails, but I do not want to lose any data. To do this I would like the function to combine rows with the same email address. Along with this, if there happens to be overlapping data for an email address that I am trying to combine into one, I want the data from the row with less cells filled in to be added into a new column. Please as questions because I know that I am not explaining this very clearly.

Below is an example of what I am looking for the function to do (data made up).

First Name	Last Name	Email	Phone	Address	Shoe Size
John	Schmitt	jschmitt@gmail.com	914-392-1840	address 1	4
Paul	Johnson	pjohnson@gmail.com	274-184-3653	address 2	2
Brad	Arnold	barnold@gmail.com	157-135-3175	address 3	5
John	Schmitt	jschmitt@gmail.com	914-392-1840		6

This sheet should become:

First Name	Last Name	Email	Phone	Address	Shoe Size	Shoe Size 2
John	Schmitt	jschmitt@gmail.com	914-392-1840	address 1	4	6
Paul	Johnson	pjohnson@gmail.com	274-184-3653	address 2	2
Brad	Arnold	barnold@gmail.com	157-135-3175	address 3	5

Basically, the phone number connected to forjschmitt@gmail.com stays in the "Phone" column because it is the same for both rows. Even though the rows are not the same for the address, because the bottom row is blank, it stays the same. Finally, a new column is created for Shoe Size, because there are two differing values for the rows that we are merging. The way that the function should pick which Shoe size to put in Shoe Size 2 is by looking at the number of cells in each row. The shoe size in the row with more cells filled goes in the original Shoe Size column. The shoe size in the row with less cells filled goes in the new Shoe Size 2 column.

Feel free to ask any questions or make any suggestions about how I could do something of this nature in an easier way. I also haven't figured out what to do if the two rows with conflicting data have the same number of cells filled...

And if anyone can make my sheets into sheets on here so you can understand better! — Elijah Appelson, Aug 23 '21 at 19:44
Please provide a [reproducible minimal example](https://stackoverflow.com/q/5963269/8107362). Especially, provide some sample data, e.g. with `dput()` and your expected result. — mnist, Aug 23 '21 at 21:18

TarJae · Answer 1 · 2021-08-24T09:03:43.280

Update: tidyverse only solution with the note of Martin Gal using chop

df %>% 
  select(-Address) %>% 
  chop(`Shoe Size`) %>% 
  unnest_wider(`Shoe Size`) %>% 
  rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>% 
  left_join(df, by= "Shoe Size") %>% 
  select(-contains(".y")) %>%
  rename_with(~str_remove(., '.x')) %>% 
  relocate(Address, .after = Phone) %>% 
  arrange(Address)

First answer:

Here is a way how we could achieve the result. The logic:

remove Address and assgin to new df1
use aggregate to basically combine the duplicate parts of rows and aggregate the not duplicate part (here: Shoe Size)
Use unnest_wider to unnest the list column
rename
left_join with df and clean with select, rename_with
relocate and arrange

library(dplyr)
library(tidyr)

# base R remove column Address and assign to df1
df1 <- df[,-5]

# aggregate Shoe Size (I don´t know how to do this in dplyr, therefore base R)
df1 <- aggregate(df1[5], df1[-5], unique)

# now with tidyverse(dpylr, tidyr)
df1 %>% 
    unnest_wider(`Shoe Size`) %>% 
    rename(`Shoe Size` = ...1, `Shoe Size 2` = ...2) %>% 
    left_join(df, by= "Shoe Size") %>% 
    select(-contains(".y")) %>%
    rename_with(~str_remove(., '.x')) %>% 
    relocate(Address, .after = Phone) %>% 
    arrange(Address)

# A tibble: 3 x 7
  `First Name` `Last Name` Email              Phone        Address   `Shoe Size` `Shoe Size 2`
  <chr>        <chr>       <chr>              <chr>        <chr>           <dbl>         <dbl>
1 John         Schmitt     jschmitt@gmail.com 914-392-1840 address 1           4             6
2 Paul         Johnson     pjohnson@gmail.com 274-184-3653 address 2           2            NA
3 Brad         Arnold      barnold@gmail.com  157-135-3175 address 3           5            NA

data:

structure(list(`First Name` = c("John", "Paul", "Brad", "John"
), `Last Name` = c("Schmitt", "Johnson", "Arnold", "Schmitt"), 
    Email = c("jschmitt@gmail.com", "pjohnson@gmail.com", "barnold@gmail.com", 
    "jschmitt@gmail.com"), Phone = c("914-392-1840", "274-184-3653", 
    "157-135-3175", "914-392-1840"), Address = c("address 1", 
    "address 2", "address 3", NA), `Shoe Size` = c(4, 2, 5, 6
    )), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))

I think you are looking for something like `df1 %>% chop(`Shoe Size`) %>% unnest_wider(`Shoe Size`)` to avoid the `aggregate` function. — Martin Gal, Aug 24 '21 at 08:43
@Martin Gal. That's perfect. Thank you for this pointer. !!! — TarJae, Aug 24 '21 at 09:02
Perhaps you could also avoid this `left_join` by using `tidyr`'s `fill`-function. — Martin Gal, Aug 24 '21 at 09:17

Merging rows based on unique emails with overlapping data

1 Answers1