splitting text in column and add row number

Question

I would like to split some text in a data frame column and save it into a data frame together with the row number or an id column.

I normally used plyr to do that, but this is no longer working in dplyr.

If I understand it correctly, it is more a bug in plyr and my code works since it is a bug.

So I am looking for the correct way to do this.

This is a minimal example in plyr:

library(plyr)
set.seed(1)
df <- data.frame(a=seq(2), 
                 b=c(paste(sample(letters,3), collapse=';'),
                     paste(sample(letters,3), collapse=';')),               
                 stringsAsFactors=FALSE)
ddply(df,.(a),summarise,unlist(strsplit(b,';')))

It turns the original data frame:

  a     b
1 1 g;j;n
2 2 x;f;v

Into this:

What would be the correct dplyr solution?

Can you show the expected output? or are you trying to replicate the results you got from `plyr` using `dplyr`? — akrun, Mar 09 '15 at 08:05
I am happy with the result from plyr... I just looking for the "correct" way of doing it, since the summarise function should be one line and not unlisted in the way I do it... And I would try to use dplyr exclusively in future — drmariod, Mar 09 '15 at 09:19

akrun · Answer 1 · 2015-03-09T08:11:13.530

4

You could do this using cSplit from splitstackshape

library(splitstackshape)
cSplit(df, 'b', ';', 'long')
#   a b
#1: 1 g
#2: 1 j
#3: 1 n
#4: 2 x
#5: 2 f
#6: 2 v

Or using dplyr/tidyr

library(dplyr)
library(tidyr)
separate(df, b, c('b1', 'b2', 'b3'), sep=";") %>%
                               gather(Var, b, -a) %>% 
                               select(-Var) %>% 
                               arrange(a)

Or another option would be to use do

df %>%
   group_by(a) %>% 
   do(data.frame(b=unlist(strsplit(.$b, ';'))))

edited Mar 09 '15 at 08:11

answered Mar 09 '15 at 08:00

akrun

874,273
37
540
662

I love the do solution! But I get a warning message... "Warning message: In rbind_all(out[[1]]) : Unequal factor levels: coercing to character" – drmariod Mar 09 '15 at 09:23
2

@user7601, add a `stringsAsFactors = FALSE` into the `data.frame` call in the `do` line. – A5C1D2H2I1M1N2O1R2T1 Mar 09 '15 at 09:27
1

(Also, `do` will scale to be the slowest option available.) – A5C1D2H2I1M1N2O1R2T1 Mar 09 '15 at 09:31

score 4 · Accepted Answer · answered Mar 09 '15 at 09:22

4

I'm biased in favor of cSplit from the "splitstackshape" package, but you might be interested in unnest from "tidyr" in conjunction with "dplyr":

library(dplyr)
library(tidyr)
df %>%
  mutate(b = strsplit(b, ";")) %>%
  unnest(b)
#   a b
# 1 1 g
# 2 1 j
# 3 1 n
# 4 2 x
# 5 2 f
# 6 2 v

answered Mar 09 '15 at 09:22

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

Hah, after thinking about it, I like the unnest function together with dplyr... Thanks for the solution. – drmariod Mar 09 '15 at 09:44
Ananda, just imagine column b would be factor and not character, how do you get your example to work? I thought I can change the strsplit into strsplit(as.character(b), ";") but this does not work... Am I missing something? – drmariod Mar 09 '15 at 09:51
@user7601, `strsplit(as.character(b), ";")` works for me. What error are you getting? – A5C1D2H2I1M1N2O1R2T1 Mar 09 '15 at 09:54
Hm, it was the naming problem... I expected I can change the name of the column.. So I tried to mutate(new_name = strsplit(b, ";")) and finally unnect(new_name)... But this didn't worked... If I use b for all names, it is working fine. – drmariod Mar 09 '15 at 10:01
@user7601, check your package version? Even that is working for me. – A5C1D2H2I1M1N2O1R2T1 Mar 09 '15 at 10:06
Just installed dplyr 0.4.1 and get the same problems: `df %>% transform(c = strsplit(as.character(b), ";")) %>% unnest(c)` gives me the following error: `Error in data.frame(list(a = 1:2, b = 1:2), c = list(c("g", "j", "n"), : arguments imply differing number of rows: 2, 3` (tidyr 0.2) – drmariod Mar 10 '15 at 06:34

splitting text in column and add row number

2 Answers2

Linked

Related