How to use tidyr::separate when the number of needed variables is unknown

Question

I've got a dataset that consists of email communication. An example:

library(dplyr)
library(tidyr)

dat <- data_frame('date' = Sys.time(), 
                  'from' = c("person1@gmail.com", "person2@yahoo.com", 
                             "person3@hotmail.com", "person4@msn.com"), 
                  'to' = c("person2@yahoo.com,person3@hotmail.com", "person3@hotmail.com", 
                           "person4@msn.com,person1@gmail.com,person2@yahoo.com", "person1@gmail.com"))

In the above example it's simple enough to see how many variables I need, so I could just do the following:

dat %>% separate(to, into = paste0("to_", 1:3), sep = ",", extra = "merge", fill = "right")

#Source: local data frame [4 x 5]
#
#                 date                from                to_1                to_2              to_3
#               (time)               (chr)               (chr)               (chr)             (chr)
#1 2015-10-22 14:52:41   person1@gmail.com   person2@yahoo.com person3@hotmail.com                NA
#2 2015-10-22 14:52:41   person2@yahoo.com person3@hotmail.com                  NA                NA
#3 2015-10-22 14:52:41 person3@hotmail.com     person4@msn.com   person1@gmail.com person2@yahoo.com
#4 2015-10-22 14:52:41     person4@msn.com   person1@gmail.com                  NA                NA

However, my dataset is 4,000 records long and I'd rather not go through and find the row with the most number of elements in it so that I can determine how many variables I need to create. My approach to handling this is to first split the column myself and get the length of each split and then find the max:

n_vars <- dat$to %>% str_split(",") %>% lapply(function(z) length(z)) %>% unlist() %>% max()

But that seems inefficient. Is there a better way of doing this?

Maybe also `library(data.table) ; cbind(dat, setDT(dat)[, tstrsplit(to, ",")])` — David Arenburg, Oct 22 '15 at 19:25
Seems like a reasonable enough solution instead of loading another package. if you're already loading `tidyverse` could use `map_dbl` and remove `unlist` to make it slightly cleaner. `dat$to %>% str_split(",") %>% map_dbl(~ length(.)) %>% max()` — Tunn, Jul 21 '17 at 02:25
You could calculate n_vars more neatly and without splitting with `n_vars <- dat$to %>% str_count(pattern = ",") %>% max() + 1` — Allen Baron, Oct 14 '19 at 22:33

jeremycg · Answer 1 · 2018-05-15T09:38:55.780

26

This is a good question - my usual repsonse is to use strsplit, then unnest and spread, which is also not super efficient:

library(dplyr)
library(tidyr)

dat %>% mutate(to = strsplit(to, ",")) %>%
        unnest(to) %>%
        group_by(from) %>%
        mutate(row = row_number()) %>%
        spread(row, to)

Source: local data frame [4 x 5]

                 date                from                   1                   2                 3
               (time)               (chr)               (chr)               (chr)             (chr)
1 2015-10-22 15:03:17   person1@gmail.com   person2@yahoo.com person3@hotmail.com                NA
2 2015-10-22 15:03:17   person2@yahoo.com person3@hotmail.com                  NA                NA
3 2015-10-22 15:03:17 person3@hotmail.com     person4@msn.com   person1@gmail.com person2@yahoo.com
4 2015-10-22 15:03:17     person4@msn.com   person1@gmail.com                  NA                NA

edited May 15 '18 at 09:38

answered Oct 22 '15 at 19:11

jeremycg

24,657
5
63
74

I think this is better, because you can optionally not spread at the end. Dividing emails into all these separate columns seems stupid. – bramtayl Oct 22 '15 at 20:17
@bramtayl I have a feeling that separate deliberately makes this hard, so you don't end up with wide data rather than long – jeremycg Oct 22 '15 at 20:25
@bramtayl I agree that having wide data in this scenario doesn't make sense (and doesn't in most situations), but it was an easy concept to wrap your head around to get to the underlying issue of using `separate` when you don't know how many new columns you'll need. Akrun's solution is still a good one because you set `direction = "long"` to get all the emails in one column. – tblznbits Oct 22 '15 at 20:28
Note that if you don't want a wide dataset just stop command at `unnest` then you ended up with `long` data format – Sinh Nguyen Sep 09 '20 at 03:29

score 19 · Accepted Answer · answered Oct 22 '15 at 19:14

19

We could use cSplit

library(splitstackshape) 
cSplit(dat, 'to', ',')

answered Oct 22 '15 at 19:14

akrun

874,273
37
540
662

How to use tidyr::separate when the number of needed variables is unknown

2 Answers2

Linked

Related