0

I have a dataframe in R which has some rows as follows:

c("LouDobbs", "gen_jackkeane") || RT @LouDobbs: #AmericaFirst- @gen_jackkeane: The Taliban for 9 months have told their fighters to kill as many people as you can, to includ…

above is an example of 2 columns where column 1 (I am using separator ||) has more than one username and column 2 has the tweet text. I want that this row should be duplicated into 2 (number of users) and each individual user singly can be placed in column 1 for all such rows in the data frame where more than 1 user is listed against the tweet text.

structure(list(user = list("Dandhy_Laksono", c("LouDobbs", "gen_jackkeane"
), "DeepStateExpose", "AndruewJamess", "jrossman12", "BiLLRaY2019", 
    "DeepStateExpose", "Dandhy_Laksono", "DeepStateExpose", "DeepStateExpose"), 
    full_text = c("RT @Dandhy_Laksono: Sebagian pendukung Jokowi ini mengalami bagaimana fitnah \"komunis dan PKI\" digunakan selama pemilu.\n\nSekarang mereka me…", 
    "RT @LouDobbs: #AmericaFirst- @gen_jackkeane: The Taliban for 9 months have told their fighters to kill as many people as you can, to includ…", 
    "RT @DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…", 
    "RT @AndruewJamess: @BillOReilly @KamalaHarris is wrong. @realDonaldTrump has accomplished a lot. He set a record for  incoherent toilet twe…", 
    "RT @jrossman12: @SaraCarterDC Pakistan won't allow that as you already know. Your husband and the other U.S. troops have been forced to fig…", 
    "RT @BiLLRaY2019: JOKOWI TIDAK MEMBUNUH KPK..!\nMarkibong…\"Selamat tinggal Taliban di dalam KPK. Kalian kalah lagi, kalah lagi..!\"\n\n#JumatBer…", 
    "RT @DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…", 
    "RT @Dandhy_Laksono: Sebagian pendukung Jokowi ini mengalami bagaimana fitnah \"komunis dan PKI\" digunakan selama pemilu.\n\nSekarang mereka me…", 
    "RT @DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…", 
    "RT @DeepStateExpose: The Only Reason The Deep State Cabal Has Stayed in Afghanistan For 18 Years Is To Protect Their Largest Poppy/Opium/Na…"
    )), row.names = c(NA, 10L), class = "data.frame")
  • as a raw idea, we can create one more column to give the frequency of strings in column 1 and then act subsequently. – Ambrish Dhaka Sep 25 '19 at 04:14
  • How is column 1 stored? Is it comma-separated string or as a vector? If you have already have date in R, can you share `dput(head(df))` to get a clear idea. – Ronak Shah Sep 25 '19 at 04:17
  • It does not need date. It just the users and their tweets. Some element in column 1 has more than 1 user. I want to make them individual for getting frequency per user in the next stage. – Ambrish Dhaka Sep 25 '19 at 04:27
  • My final goal is a dataframe with columns as User, Tweet, Frequency. For each individual user and his tweets. – Ambrish Dhaka Sep 25 '19 at 04:29
  • 1
    So `tidyr::unnest(df, user)` would work ? – Ronak Shah Sep 25 '19 at 04:32
  • trying `t5b <- unnest(t5, user)`, but taking an unusually long time. `nrow` are 293469. Seems to have stuck, aborting command. Any other method? – Ambrish Dhaka Sep 25 '19 at 04:43
  • There are few approaches in the marked link https://stackoverflow.com/questions/26194298/unlist-data-frame-column-preserving-information-from-other-column if all of them do not give satisfactory results to you, can you please update your post stating that you have tried answers from the given link and they don't work for you. I'll reopen the question then. – Ronak Shah Sep 25 '19 at 04:46
  • yes I have gone through `tydr`, `data.table` approach and they are not working for me. – Ambrish Dhaka Sep 25 '19 at 04:52
  • ok..reopened the question. – Ronak Shah Sep 25 '19 at 04:55

1 Answers1

1

We can use lengths to get the length of each of the elements of the list column. It should be fast enough as lengths is fast

l1 <- lengths(df$user)
out <- data.frame(user = unlist(df$user), n = rep(l1, l1),
          text = rep(df$full_text, l1))
akrun
  • 874,273
  • 37
  • 540
  • 662