1

With the following data frame:

 Query
1 word1 word2 word3
2 word4 word5
3 word6
4 word7 word8

df1 <- read.table(header = TRUE, text = "Query
1 'word1 word2 word3'
2 'word4 word5'
3 word6
4 'word7 word8'")

I want create a new data frame with the following format:

 Query
1 word1
2 word2
3 word3
4 word4
5 word5
6 word6
7 word7
8 word8

Do you know of a R function or a R package able to do it? I am actually using the function word from stringr's package, but it seems to only being splitting the first word of every observation. For getting all words, I should have to calculate every row's bond (1, total number of words <- whitespaces + 1) and call the function within a "for" for every row, which is not optimal as my dataframe have more than 5M observations.

rawr
  • 20,481
  • 4
  • 44
  • 78
agustin
  • 1,311
  • 20
  • 42

3 Answers3

3

You can try

library(splitstackshape)
cSplit(df1, 'Query', ' ', 'long')
#    Query
#1: word1
#2: word2
#3: word3
#4: word4
#5: word5
#6: word6
#7: word7
#8: word8
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks. It works like a silk. Some time is required as the data frame is large, but I am also using an old laptop. At work, with modern desktops, it should go really fast. – agustin Apr 08 '15 at 11:49
1

I'm biased in favor of cSplit (and it should be reasonably fast even on 5M observations) however (converting my comment to an answer) you can also consider just using strsplit from base R, which is quite fast, especially if your delimiter is fixed:

data.frame(Query = unlist(strsplit(as.character(df1$Query), " ", TRUE)))
#   Query
# 1 word1
# 2 word2
# 3 word3
# 4 word4
# 5 word5
# 6 word6
# 7 word7
# 8 word8

Another approach comes from the Hadleyverse: Use strsplit and then unest from "tidyr".

library(dplyr)
library(tidyr)

df1 %>%
  mutate(Query = strsplit(as.character(Query), " ", TRUE)) %>%
  unnest(Query)
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • I would have also used `strsplit`, but I like the `cSplit` for its compactness. Great function! – akrun Apr 08 '15 at 04:48
  • @akrun, thanks. It's in dire need of some cleaning up, but I can't seem to get the time these days to work on my packages. PRs are welcome :-) – A5C1D2H2I1M1N2O1R2T1 Apr 08 '15 at 04:59
  • Hi @akrun and Ananda. Thanks for your answers. Both approaches also worked very well, and as I am not a performance freak, I didn't time every approach to see which one was faster. The script that updates my data needs actually a lot (and I always go to get me a cooffe meanwhile). However, I chosed akrun's approach because I use already splitstackshape for the stratified function and, as you say, it looks so cute and compact... Being the daddy from splitstackshape, I figure you understand my choice :) – agustin Apr 08 '15 at 12:00
0

This is a memory efficient way of solving the problem in that I initalize a new data.table and modify by reference. This approach uses the data.table package.

library(data.table)
# define data.table
Query<- data.table(x = c("word1 word2 word3", "word4 word5", "word6", "word7 word8"))

# create list containing each word as a character vector in each row
Query[, x:= lapply(x, str_split, pattern = " ") ]

# determine number of words in each list/row
Query [ , length:= vapply(x, function(x) length(x[[1]]), numeric(1))]

# initialize data.table of length equal to number of rows needed
new_Query <- data.table(x = rep(NA_character_, sum(Query$length, na.rm =T)))

# loop through Query to assign the appropriate values to new_Query
j <- 1
for(i in 1:nrow(Query)){
  index2 <- j+Query[i, length]-1
  new_Query[j:index2, x := Query[i, unlist(x)]]
  j <- index2+1
}
k13
  • 713
  • 8
  • 17