0

My data set is like this:

tweet_created_at                              hashtag_text
2015-05-08 05:45:30                           farinaz,farkhunda,ozgecanaslan
2015-05-08 06:01:24                           ozgecanaslan,sendeanlat
2015-05-08 09:51:35                           ozgecanaslan,genclikyasaklanamaz

I need to convert my data set to this:

tweet_created_at                              hashtag_text
2015-05-08 05:45:30                           farinaz
2015-05-08 05:45:30                           farkhunda
2015-05-08 05:45:30                           ozgecanaslan
2015-05-08 06:01:24                           ozgecanaslan
2015-05-08 06:01:24                           sendeanlat
2015-05-08 09:51:35                           ozgecanaslan
2015-05-08 09:51:35                           genclikyasaklanamaz

I assume that I can use some sapply for this but I couldn't figure out doing this with repeating tweet_created_at column.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
eabanoz
  • 251
  • 3
  • 17

2 Answers2

3

You could try cSplit from library(splitstackshape). We specify the sep as , direction as 'long' and the splitCols as 'hash_tag_text' to split the column and reshape the dataset to 'long' format.

 library(splitstackshape)
 cSplit(df1, 'hashtag_text', ',', 'long')
 #      tweet_created_at        hashtag_text
 #1: 2015-05-08 05:45:30             farinaz
 #2: 2015-05-08 05:45:30           farkhunda
 #3: 2015-05-08 05:45:30        ozgecanaslan
 #4: 2015-05-08 06:01:24        ozgecanaslan
 #5: 2015-05-08 06:01:24          sendeanlat
 #6: 2015-05-08 09:51:35        ozgecanaslan
 #7: 2015-05-08 09:51:35 genclikyasaklanamaz

data

 df1 <- structure(list(tweet_created_at = c("2015-05-08 05:45:30", 
 "2015-05-08 06:01:24", 
 "2015-05-08 09:51:35"), hashtag_text =   
 c("farinaz,farkhunda,ozgecanaslan", 
 "ozgecanaslan,sendeanlat", "ozgecanaslan,genclikyasaklanamaz"
 )), .Names = c("tweet_created_at", "hashtag_text"),
 class = "data.frame", row.names = c(NA, -3L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi akrun, it is giving same row numbers with original data set. My original data set has 2274 obs. and after use this script the result is same. – eabanoz Aug 03 '15 at 20:07
  • @eabanoz I was just copy/pasting the example you showed to get the expected output. – akrun Aug 03 '15 at 20:09
  • @eabanoz I tried your dput example. It gives the long format. Earlier I think the argument order was the problem. Can you try it now? – akrun Aug 03 '15 at 20:23
  • 1
    Hi akrun, it works and thanks – eabanoz Aug 03 '15 at 20:57
  • @eabanoz Thanks for commenting. I also had the same problem after I changed the order. Looks like order of arguments are important. – akrun Aug 03 '15 at 20:58
2

Using data.table:

library(data.table)
setDT(Womens.Rights)[,c(hashtag_text=strsplit(hashtag_text,split=",")),
                     by=tweet_created_at]
      tweet_created_at        hashtag_text
1: 2015-05-08_05:45:30             farinaz
2: 2015-05-08_05:45:30           farkhunda
3: 2015-05-08_05:45:30        ozgecanaslan
4: 2015-05-08_06:01:24        ozgecanaslan
5: 2015-05-08_06:01:24          sendeanlat
6: 2015-05-08_09:51:35        ozgecanaslan
7: 2015-05-08_09:51:35 genclikyasaklanamaz

(Note: I added underscores to the times manually to let read.table read in your data)

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • Hi Michael, thank you for your help and this script is giving this error message :Error in `[.data.table`(setDT(Womens.Rights), , strsplit(hashtag_text, : j doesn't evaluate to the same number of columns for each group. (Womens.Rights is my date set name) – eabanoz Aug 03 '15 at 19:54
  • @eabanoz you've got `by=tweet_created_at`, right? what version of `data.table` do you have installed... maybe this is some new behavior. – MichaelChirico Aug 03 '15 at 19:58
  • does this work? `setDT(Womens.Rights)[,strsplit(hashtag_text,split=","),by=tweet_created_at]` – MichaelChirico Aug 03 '15 at 20:00
  • I used it in the same format by=tweet_created_at and my data.table version is 1.9.4 and it is working under R version 3.2.1. – eabanoz Aug 03 '15 at 20:01
  • Maybe the class of `hashtag_text` is wrong. Can you put the output of `dput(head(Womens.Rights))` into your question? – MichaelChirico Aug 03 '15 at 20:02
  • structure(list(tweet_created_at = structure(c(1431070647, 1431068077, 1431070163, 1431078330, 1431079284, 1431082640), class = c("POSIXct", "POSIXt"), tzone = ""), hashtag_text = c("ferinazxosrewani,ozgecanaslan", "ozgecanaslan,sendeanlat", "ozgecanaslan", "farinaz,farkhunda,ozgecanaslan", "ozgecanaslan,sendeanlat", "ozgecanaslan")), .Names = c("tweet_created_at", "hashtag_text"), class = c("data.table", "data.frame"), row.names = c(NA, -6L), .internal.selfref = ) – eabanoz Aug 03 '15 at 20:03
  • So does it work or not? – David Arenburg Aug 03 '15 at 20:05
  • Strange. My code still works exactly on your `structure`. This doesn't work: `setDT(Womens.Rights)[,strsplit(hashtag_text,split=","),by=tweet_created_at]`? – MichaelChirico Aug 03 '15 at 20:06
  • Michael, it works and thanks. – eabanoz Aug 03 '15 at 20:56