3

I have the following dataframe

group = c("cat", "dog", "horse")
value = c("1", "2", "3")
list = c("siamese,burmese,balinese","corgi,sheltie,collie","arabian,friesian,andalusian" )
df = data.frame(group, value, list)

df
  group value                        list
1   cat     1    siamese,burmese,balinese
2   dog     2        corgi,sheltie,collie
3 horse     3 arabian,friesian,andalusian

and am trying to achieve this:

  group value       list
1   cat     1    siamese
2   cat     1    burmese
3   cat     1   balinese
4   dog     2      corgi
5   dog     2    sheltie
6   dog     2     collie
7 horse     3    arabian
8 horse     3   friesian
9 horse     3 andalusian

I know how to summarize a dataframe, but I now realize that I don't know how to "unsummarize" one with comma separated strings.

Helen K.
  • 69
  • 7
  • 1
    Welcome to SO! Great first quetion! ! Try: `tidyr::separate_rows(df, list, sep=",")` – hrbrmstr Nov 25 '18 at 23:57
  • I *always* forget about `separate_rows` and start reaching for unnest and strsplit. Thanks for the reminder. – neilfws Nov 25 '18 at 23:59
  • @neilfws I regularly have to reference the help on both `gather` and `separate` as — for some reason — I can't keep the parameters in active memory and I tend to start from the pkg help index page so I see `separate_rows` _alot_ :-) Perhaps `disunite` (or something like that) might be a better name or alias for it. – hrbrmstr Nov 26 '18 at 00:03

2 Answers2

2
data.frame(
  group = c("cat", "dog", "horse"),
  value = c("1", "2", "3"),
  list = c("siamese,burmese,balinese","corgi,sheltie,collie","arabian,friesian,andalusian"),
  stringsAsFactors = FALSE
) -> xdf

tidyverse:

tidyr::separate_rows(xdf, list, sep=",")
##   group value       list
## 1   cat     1    siamese
## 2   cat     1    burmese
## 3   cat     1   balinese
## 4   dog     2      corgi
## 5   dog     2    sheltie
## 6   dog     2     collie
## 7 horse     3    arabian
## 8 horse     3   friesian
## 9 horse     3 andalusian

Base R:

do.call(
  rbind.data.frame,
  lapply(1:nrow(xdf), function(idx) {

    data.frame(
      group = xdf[idx, "group"],
      value = xdf[idx, "value"],
      list = strsplit(xdf[idx, "list"], ",")[[1]],
      stringsAsFactors = FALSE
    )

  })
)
##   group value       list
## 1   cat     1    siamese
## 2   cat     1    burmese
## 3   cat     1   balinese
## 4   dog     2      corgi
## 5   dog     2    sheltie
## 6   dog     2     collie
## 7 horse     3    arabian
## 8 horse     3   friesian
## 9 horse     3 andalusian

The shootout:

microbenchmark::microbenchmark(

  unnest = transform(xdf, list = strsplit(list, ",")) %>%
    tidyr::unnest(list),

  separate_rows = tidyr::separate_rows(xdf, list, sep=","),

  base = do.call(
    rbind.data.frame,
    lapply(1:nrow(xdf), function(idx) {

      data.frame(
        group = xdf[idx, "group"],
        value = xdf[idx, "value"],
        list = strsplit(xdf[idx, "list"], ",")[[1]],
        stringsAsFactors = FALSE
      )

    })
  )
)
## Unit: microseconds
##           expr      min        lq     mean   median        uq       max neval
##         unnest 3689.890 4280.7045 6326.231 4881.160  6428.508 16670.715   100
##  separate_rows 5093.618 5602.2510 8479.712 6289.193 10352.847 24447.528   100
##           base  872.343  975.1615 1589.915 1099.391  1660.324  6663.132   100

I'm constantly surprised at the horrible performance of tidyr ops.

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
0

IIUC we have unnest in R

library(dplyr)
library(tidyr)

df = data.frame(group, value, list,stringsAsFactors = F)
df %>%
   transform(list = strsplit(list, ",")) %>%
   unnest(list)
  group value       list
1   cat     1    siamese
2   cat     1    burmese
3   cat     1   balinese
4   dog     2      corgi
5   dog     2    sheltie
6   dog     2     collie
7 horse     3    arabian
8 horse     3   friesian
9 horse     3 andalusian
BENY
  • 317,841
  • 20
  • 164
  • 234