1

I want to transforms a data frame with contact information with of a for a list of municipalities in which similar information such as e.g. phone number appears in multiple columns.

I have tried using both reshape2::dcast() as well as tidyr::spread(), neither of which solves my problem. I have also checked other post of stack overflow e.g.

Multiple column spread

Have yet to find a solution which works. It seems to me that the problems should be fairly straightforward (and solvable with spread or dcast).

tmp <- tibble(municipality = c("M1", "M2"), 
       name1 = c("n1", "n2"), name2 = c("n3", "n4"), name3 = c(NA, "n5"), # placeholder names
       phone1 = c("p1", "p2"), phone2 = c("p3", "p4"), phone3 = c(NA, "p5")) # placeholder phone numbers

#solution 1
tmp %>% gather("colname", "value", -municipality) %>% 
  filter(municipality == "M1") %>% #too simplify, should be replaced with group_by(municipality)
  na.omit() %>% mutate(colname = str_replace(colname, "\\d", replacement = "")) %>% 
  spread(., key = "colname", value = "value")

#Solution 2
tmp %>% gather("colname", "value", -municipality) %>% 
  filter(municipality == "M1") %>% # same as above
  na.omit() %>% mutate(colname = str_replace(colname, "\\d", replacement = "")) %>% 
  dcast(municipality + value ~colname)


Solution 1 results in the following error: Error: Each row of output must be identified by a unique combination of keys.

Solution 2 results in the following data frame (which is the desired result except it needs to be collapsed):

  municipality value name phone
1           M1    n1   n1  <NA>
2           M1    n3   n3  <NA>
3           M1    p1 <NA>    p1
4           M1    p3 <NA>    p3
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213

2 Answers2

1

Are you looking for?

library(dplyr)
library(tidyr)

tmp %>%
  gather(key, value, -municipality, na.rm = TRUE) %>%
  mutate(key = gsub("\\d+", "", key)) %>%
  group_by(municipality, key) %>%
  mutate(row = row_number()) %>%
  spread(key, value) %>%
  select(-row)

# municipality name  phone
# <chr>        <chr> <chr>
#1 M1           n1    p1   
#2 M1           n3    p3   
#3 M2           n2    p2   
#4 M2           n4    p4   
#5 M2           n5    p5  

We can use gather to bring the data in long format dropping NA values. Remove numbers from individual column names so that they share the same key, create a column group_by municipality and key to spread the data into wide format.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Exactly! Would you mind explaining why you need to group by key and mutate add the row number? – Kenneth Enevoldsen Aug 06 '19 at 12:02
  • @KennethEnevoldsen as there is no common identifier to `spread` columns we need that. If we do without that we get an error. `tmp %>% gather(key, value, -municipality, na.rm = TRUE) %>% mutate(key = gsub("\\d+", "", key)) %>% spread(key, value)` – Ronak Shah Aug 06 '19 at 12:44
1

We can do this elegantly with pivot_longer from the dev version of tidyr

library(dplyr)
library(tidyr)# 0.8.3.9000
library(stringr)
tmp %>%
   rename_at(-1,  ~str_replace(., "(\\d+$)", "_\\1")) %>%
   pivot_longer(cols = -municipality, names_to = c(".value", "group"), 
        names_sep="_", values_drop_na = TRUE) %>%
   select(-group)
# A tibble: 5 x 3
#  municipality name  phone
#  <chr>        <chr> <chr>
#1 M1           n1    p1   
#2 M1           n3    p3   
#3 M2           n2    p2   
#4 M2           n4    p4   
#5 M2           n5    p5   

Or another option is melt from data.table

library(data.table)
melt(setDT(tmp), measure = patterns("^name", "^phone"), 
   value.name = c("name", "phone"), na.rm = TRUE)[, variable := NULL][]
#.  municipality name phone
#1:           M1   n1    p1
#2:           M2   n2    p2
#3:           M1   n3    p3
#4:           M2   n4    p4
#5:           M2   n5    p5
akrun
  • 874,273
  • 37
  • 540
  • 662