0

I have a dataset, in which hashtags columns contains variable number of hashtags separated by ;. For example:

id  hashtags
1   #apple;#peach           
2   #apple          
3   #pear;#orange;#banana

What I need to do is separate the column and convert it to tidy data. So each row contains id and one single hashtag.

id  hashtag
1   #apple          
1   #peach          
2   #apple          
3   #pear           
3   #orange         
3   #banana

I can do that with loop like below.

library(tidyverse)
df = data.frame(id=1:3, hashtags=c("#apple;#peach", "#apple", "#pear;#orange;#banana"))

dat = data.frame()
for (i in 1:nrow(df)) {
  dt = data.frame(id = df$id[i], 
                  hashtag = str_split(df$hashtags[i], ';')[[1]])
  dat = bind_rows(dat, dt)
}

But I think the way above is not a good solution. I found it's extremely slow when I have a very large dataset (my actual data).

Does anyone have a better and much quicker way to achieve that? Note that the number of hashtags for each id varies a lot. tidyr::separate does not seems to apply here.

M--
  • 25,431
  • 8
  • 61
  • 93
zesla
  • 11,155
  • 16
  • 82
  • 147

1 Answers1

5

An option is separate_rows and specify the sep

library(tidyverse)
df %>% 
   separate_rows(hashtags, sep=";")
#. id hashtags
#1  1   #apple
#2  1   #peach
#3  2   #apple
#4  3    #pear
#5  3  #orange
#6  3  #banana

If we don't specify the sep, it will automatically select all the possible delimiters found in the data

akrun
  • 874,273
  • 37
  • 540
  • 662