0

I have a column in a data table which is a list of comma separated values

dt = data.table( a = c('a','b','c'), b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt'))
> dt
   a           b
1: a    xx,yy,zz
2: b       mm,nn
3: c qq,rr,ss,tt

I would like to transform it into a long format

   a  b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This question has been answered for a data frame here. I'm wondering if there is an elegant data table solution.

Community
  • 1
  • 1
Kerry
  • 411
  • 4
  • 13
  • You can try `setnames(dt[, strsplit(b, ','), a], "V1", "b")[]` – akrun Jan 11 '17 at 17:04
  • You're too fast. I was getting to your comment. I would have initially accepted your comment as the answer and for being first. I think the accepted answer is cleaner. Appreciate your quick response! – Kerry Jan 11 '17 at 17:30

3 Answers3

4

The following will work for your example:

dt[, c(b=strsplit(b, ",")), by=a]
   a b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This method fails if the "by" variable is repeated as in

dt = data.table(a = c('a','b','c', 'a'),
                b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))

One robust solution in this situation can be had by using paste to collapse all observations with the same grouping variable (a) and feeding the result to the code above.

dt[, .(b=paste(b, collapse=",")), by=a][, c(b=strsplit(b, ",")), by=a]

This returns

    a  b
 1: a xx
 2: a yy
 3: a zz
 4: a zz
 5: a gg
 6: a tt
 7: b mm
 8: b nn
 9: c qq
10: c rr
11: c ss
12: c tt
lmo
  • 37,904
  • 9
  • 56
  • 69
  • I'm familiar with using 'by' for aggregation and reduction. Interesting to see it employed for expansion. Thanks! – Kerry Jan 11 '17 at 17:28
  • Thanks. I added a potentially more robust solution in the face of repeated IDs. – lmo Jan 11 '17 at 17:40
  • I don't find that the second half of this answer works as shown here. I am getting column `a` repeated a/b/c/a x3, not the pattern in above. – moman822 Sep 17 '19 at 14:01
  • 1
    @moman822 You are right. I have corrected this to provide an implementation that will work for such a case. – lmo Sep 17 '19 at 21:30
1

There is another method, but this method involves another package : splitstackshape.

library(splitstackshape)
cSplit(dt, "b", sep = ",", direction = "long")
   a  b
1: a xx
2: a yy
3: a zz
4: b mm
5: b nn
6: c qq
7: c rr
8: c ss
9: c tt

This function uses data.table to work. And this work even if we have multiple same value for the column "a".

Rilgar17
  • 181
  • 1
  • 7
-2

We can split the column 'b' by the delimiter ',' (using strsplit), grouped by 'a' and set the name of the new column i.e. 'V1' to 'b' with setnames

setnames(dt[, strsplit(b, ','), by = a], "V1", "b")[]
#   a  b
#1: a xx
#2: a yy
#3: a zz
#4: b mm
#5: b nn
#6: c qq
#7: c rr
#8: c ss
#9: c tt

If there are repeating elements in 'a' as in the below example

dt <- data.table(a = c('a','b','c', 'a'),
            b = c('xx,yy,zz','mm,nn','qq,rr,ss,tt', 'zz,gg,tt'))

we can group by the sequence of rows, do the strsplit on 'b', concatenate with the 'a' column and assign (:=) the 'grp' to NULL

dt[, c(a=a, b=strsplit(b, ",")), .(grp = 1:nrow(dt))][, grp := NULL][]
#    a  b
# 1: a xx
# 2: a yy
# 3: a zz
# 4: b mm
# 5: b nn
# 6: c qq
# 7: c rr
# 8: c ss
# 9: c tt
#10: a zz
#11: a gg
#12: a tt

NOTE: Both the methods are data.table methods

akrun
  • 874,273
  • 37
  • 540
  • 662