R - Split words from each obs. in dataframe into a new dataframe

Question

With the following data frame:

 Query
1 word1 word2 word3
2 word4 word5
3 word6
4 word7 word8

df1 <- read.table(header = TRUE, text = "Query
1 'word1 word2 word3'
2 'word4 word5'
3 word6
4 'word7 word8'")

I want create a new data frame with the following format:

 Query
1 word1
2 word2
3 word3
4 word4
5 word5
6 word6
7 word7
8 word8

Do you know of a R function or a R package able to do it? I am actually using the function word from stringr's package, but it seems to only being splitting the first word of every observation. For getting all words, I should have to calculate every row's bond (1, total number of words <- whitespaces + 1) and call the function within a "for" for every row, which is not optimal as my dataframe have more than 5M observations.

Why not just `unlist` after `strsplit`? Do you have other columns? — A5C1D2H2I1M1N2O1R2T1, Apr 07 '15 at 17:20
In other words, just `unlist(strsplit(as.character(df1$Query), "\\s"))` — David Arenburg, Apr 07 '15 at 17:46
Or if you already using `stringi`, this would be even quicker `unlist(stri_split_fixed(df1$Query, " "))` — David Arenburg, Apr 07 '15 at 17:54
@DavidArenburg that would be more robust if you used `\\s+` but still works for the example given `data.frame(Query = unlist(strsplit(as.character(df1$Query), "\\s+")))` — rawr, Apr 07 '15 at 18:57
@DavidArenburg, speculation or proof? I get faster performance with base R's `strsplit` with `fixed = TRUE`. — A5C1D2H2I1M1N2O1R2T1, Apr 07 '15 at 18:58
@AnandaMahto mainly speculations, though I was thinking there is some overhead when calling `as.character` in base compared to `stringi`. — David Arenburg, Apr 07 '15 at 18:59

score 3 · Accepted Answer · answered Apr 07 '15 at 17:21

3

You can try

library(splitstackshape)
cSplit(df1, 'Query', ' ', 'long')
#    Query
#1: word1
#2: word2
#3: word3
#4: word4
#5: word5
#6: word6
#7: word7
#8: word8

answered Apr 07 '15 at 17:21

akrun

874,273
37
540
662

Thanks. It works like a silk. Some time is required as the data frame is large, but I am also using an old laptop. At work, with modern desktops, it should go really fast. – agustin Apr 08 '15 at 11:49

score 1 · Answer 2 · answered Apr 08 '15 at 03:45

1

I'm biased in favor of cSplit (and it should be reasonably fast even on 5M observations) however (converting my comment to an answer) you can also consider just using strsplit from base R, which is quite fast, especially if your delimiter is fixed:

data.frame(Query = unlist(strsplit(as.character(df1$Query), " ", TRUE)))
#   Query
# 1 word1
# 2 word2
# 3 word3
# 4 word4
# 5 word5
# 6 word6
# 7 word7
# 8 word8

Another approach comes from the Hadleyverse: Use strsplit and then unest from "tidyr".

library(dplyr)
library(tidyr)

df1 %>%
  mutate(Query = strsplit(as.character(Query), " ", TRUE)) %>%
  unnest(Query)

answered Apr 08 '15 at 03:45

A5C1D2H2I1M1N2O1R2T1

190,393
28
405
485

I would have also used `strsplit`, but I like the `cSplit` for its compactness. Great function! – akrun Apr 08 '15 at 04:48
@akrun, thanks. It's in dire need of some cleaning up, but I can't seem to get the time these days to work on my packages. PRs are welcome :-) – A5C1D2H2I1M1N2O1R2T1 Apr 08 '15 at 04:59
Hi @akrun and Ananda. Thanks for your answers. Both approaches also worked very well, and as I am not a performance freak, I didn't time every approach to see which one was faster. The script that updates my data needs actually a lot (and I always go to get me a cooffe meanwhile). However, I chosed akrun's approach because I use already splitstackshape for the stratified function and, as you say, it looks so cute and compact... Being the daddy from splitstackshape, I figure you understand my choice :) – agustin Apr 08 '15 at 12:00

k13 · Answer 3 · 2015-04-16T01:58:39.893

This is a memory efficient way of solving the problem in that I initalize a new data.table and modify by reference. This approach uses the data.table package.

library(data.table)
# define data.table
Query<- data.table(x = c("word1 word2 word3", "word4 word5", "word6", "word7 word8"))

# create list containing each word as a character vector in each row
Query[, x:= lapply(x, str_split, pattern = " ") ]

# determine number of words in each list/row
Query [ , length:= vapply(x, function(x) length(x[[1]]), numeric(1))]

# initialize data.table of length equal to number of rows needed
new_Query <- data.table(x = rep(NA_character_, sum(Query$length, na.rm =T)))

# loop through Query to assign the appropriate values to new_Query
j <- 1
for(i in 1:nrow(Query)){
  index2 <- j+Query[i, length]-1
  new_Query[j:index2, x := Query[i, unlist(x)]]
  j <- index2+1
}

Why not just `Query[, list(x = unlist(x))]` after the `str_split` step? Also, you should mention whatever packages are used in an answer. — A5C1D2H2I1M1N2O1R2T1, Apr 08 '15 at 02:29

R - Split words from each obs. in dataframe into a new dataframe

3 Answers3