String tokenization inside R data frame

Asked Sep 17 '14 at 21:08

Active Sep 17 '14 at 21:08

Viewed 537 times

I have data frames in R which can be reproduced with this code:

id1 <- c("NP", "AK", "HT")
id2 <- c("t1", "t5", "t2")
Sentence <- c("This is an example .", "This too !", "Ok")
df <- data.frame(id1, id2, Sentence)

It looks like this:

  id1   id2   Sentence
1  NP    t1   This is an example .
2  AK    t5   This too !
3  HT    t2   Ok

And I would like to restructure it into something like this, where each unit in Sentence column is divided by the spaces:

  id1   id2   Sentence
1  NP    t1   This
2  NP    t1   is
3  NP    t1   an
4  NP    t1   example
5  NP    t1   .
6  AK    t5   This
7  AK    t5   too
8  AK    t5   !
9  HT    t2   Ok

I know there is the function strsplit, then package tm seems to have also function called tokenizer, but I don't really understand how I can do something like this inside a data frame.

Thank you!

asked Sep 17 '14 at 21:08

nikopartanen

2

This is a duplicate of so many questions... Try `library(data.table); setDT(df)[, unlist(strsplit(as.character(Sentence), " ")), by = list(id1, id2)]` for starters. Until I'll gather some dups – David Arenburg Sep 17 '14 at 21:15

String tokenization inside R data frame

0 Answers0