Split a column by group

Question

My data set is like this:

tweet_created_at                              hashtag_text
2015-05-08 05:45:30                           farinaz,farkhunda,ozgecanaslan
2015-05-08 06:01:24                           ozgecanaslan,sendeanlat
2015-05-08 09:51:35                           ozgecanaslan,genclikyasaklanamaz

I need to convert my data set to this:

tweet_created_at                              hashtag_text
2015-05-08 05:45:30                           farinaz
2015-05-08 05:45:30                           farkhunda
2015-05-08 05:45:30                           ozgecanaslan
2015-05-08 06:01:24                           ozgecanaslan
2015-05-08 06:01:24                           sendeanlat
2015-05-08 09:51:35                           ozgecanaslan
2015-05-08 09:51:35                           genclikyasaklanamaz

I assume that I can use some sapply for this but I couldn't figure out doing this with repeating tweet_created_at column.

Hi David, thank you for your notice and this is exactly same question with same solution. I tried to delete it but I don't have permission for this :( — eabanoz, Aug 03 '15 at 20:18

akrun · Accepted Answer · 2015-08-03T20:19:01.593

3

You could try cSplit from library(splitstackshape). We specify the sep as , direction as 'long' and the splitCols as 'hash_tag_text' to split the column and reshape the dataset to 'long' format.

 library(splitstackshape)
 cSplit(df1, 'hashtag_text', ',', 'long')
 #      tweet_created_at        hashtag_text
 #1: 2015-05-08 05:45:30             farinaz
 #2: 2015-05-08 05:45:30           farkhunda
 #3: 2015-05-08 05:45:30        ozgecanaslan
 #4: 2015-05-08 06:01:24        ozgecanaslan
 #5: 2015-05-08 06:01:24          sendeanlat
 #6: 2015-05-08 09:51:35        ozgecanaslan
 #7: 2015-05-08 09:51:35 genclikyasaklanamaz

data

 df1 <- structure(list(tweet_created_at = c("2015-05-08 05:45:30", 
 "2015-05-08 06:01:24", 
 "2015-05-08 09:51:35"), hashtag_text =   
 c("farinaz,farkhunda,ozgecanaslan", 
 "ozgecanaslan,sendeanlat", "ozgecanaslan,genclikyasaklanamaz"
 )), .Names = c("tweet_created_at", "hashtag_text"),
 class = "data.frame", row.names = c(NA, -3L))

edited Aug 03 '15 at 20:19

answered Aug 03 '15 at 20:02

akrun

874,273
37
540
662

Hi akrun, it is giving same row numbers with original data set. My original data set has 2274 obs. and after use this script the result is same. – eabanoz Aug 03 '15 at 20:07
@eabanoz I was just copy/pasting the example you showed to get the expected output. – akrun Aug 03 '15 at 20:09
@eabanoz I tried your dput example. It gives the long format. Earlier I think the argument order was the problem. Can you try it now? – akrun Aug 03 '15 at 20:23
1

Hi akrun, it works and thanks – eabanoz Aug 03 '15 at 20:57
@eabanoz Thanks for commenting. I also had the same problem after I changed the order. Looks like order of arguments are important. – akrun Aug 03 '15 at 20:58

MichaelChirico · Answer 2 · 2015-08-03T19:56:45.587

2

Using data.table:

library(data.table)
setDT(Womens.Rights)[,c(hashtag_text=strsplit(hashtag_text,split=",")),
                     by=tweet_created_at]
      tweet_created_at        hashtag_text
1: 2015-05-08_05:45:30             farinaz
2: 2015-05-08_05:45:30           farkhunda
3: 2015-05-08_05:45:30        ozgecanaslan
4: 2015-05-08_06:01:24        ozgecanaslan
5: 2015-05-08_06:01:24          sendeanlat
6: 2015-05-08_09:51:35        ozgecanaslan
7: 2015-05-08_09:51:35 genclikyasaklanamaz

(Note: I added underscores to the times manually to let read.table read in your data)

edited Aug 03 '15 at 19:56

answered Aug 03 '15 at 19:31

MichaelChirico

33,841
14
113
198

Hi Michael, thank you for your help and this script is giving this error message :Error in `[.data.table`(setDT(Womens.Rights), , strsplit(hashtag_text, : j doesn't evaluate to the same number of columns for each group. (Womens.Rights is my date set name) – eabanoz Aug 03 '15 at 19:54
@eabanoz you've got `by=tweet_created_at`, right? what version of `data.table` do you have installed... maybe this is some new behavior. – MichaelChirico Aug 03 '15 at 19:58
does this work? `setDT(Womens.Rights)[,strsplit(hashtag_text,split=","),by=tweet_created_at]` – MichaelChirico Aug 03 '15 at 20:00
I used it in the same format by=tweet_created_at and my data.table version is 1.9.4 and it is working under R version 3.2.1. – eabanoz Aug 03 '15 at 20:01
Maybe the class of `hashtag_text` is wrong. Can you put the output of `dput(head(Womens.Rights))` into your question? – MichaelChirico Aug 03 '15 at 20:02
structure(list(tweet_created_at = structure(c(1431070647, 1431068077, 1431070163, 1431078330, 1431079284, 1431082640), class = c("POSIXct", "POSIXt"), tzone = ""), hashtag_text = c("ferinazxosrewani,ozgecanaslan", "ozgecanaslan,sendeanlat", "ozgecanaslan", "farinaz,farkhunda,ozgecanaslan", "ozgecanaslan,sendeanlat", "ozgecanaslan")), .Names = c("tweet_created_at", "hashtag_text"), class = c("data.table", "data.frame"), row.names = c(NA, -6L), .internal.selfref = ) – eabanoz Aug 03 '15 at 20:03
So does it work or not? – David Arenburg Aug 03 '15 at 20:05
Strange. My code still works exactly on your `structure`. This doesn't work: `setDT(Womens.Rights)[,strsplit(hashtag_text,split=","),by=tweet_created_at]`? – MichaelChirico Aug 03 '15 at 20:06
Michael, it works and thanks. – eabanoz Aug 03 '15 at 20:56

Split a column by group

2 Answers2

data