3

In r, I'm currently working with datasets of conversations. The data currently looks like the following:

Mike, "Hello how are you"
Sally, "Good you"

I plan to eventually create a word cloud of this data and would need it to look like this:

Mike, Hello
Mike, how
Mike, are
Mike, you
Sally, good
Sally, you
Joseph K.
  • 1,055
  • 3
  • 23
  • 46
  • 1
    What did you try so far? – dww Dec 18 '17 at 22:41
  • I'm not entirely familiar with R so I do not know much. Previously when I only had the long strings and no name attached to them, I did thing1 <- strsplit(df, " ") df1 <- data.frame(thing1 = unlist(thing1)) – Bradley Erickson Dec 18 '17 at 22:43
  • 3
    Your title is not really representative of what you're trying to do. "How to separate a sentence into words" or similar would be better. – alistaire Dec 18 '17 at 22:54

2 Answers2

4

Perhaps something like this using reshape2::melt?

# Sample data
df <- read.csv(text =
    'Mike, "Hello how are you"
    Sally, "Good you"', header = F)

# Split on words
lst <- strsplit(trimws(as.character(df[, 2])), "\\s");
names(lst) <- trimws(df[, 1]);

# Reshape into long dataframe 
library(reshape2);
df.long <- (melt(lst))[2:1];
#     L1 value
#1  Mike Hello
#2  Mike   how
#3  Mike   are
#4  Mike   you
#5 Sally  Good
#6 Sally   you

Explanation: Split trailing/leading whitespace-trimmed (trimws) entries in second column on whitespace \\s and store in list. Take list entry names from first column, and reshape into a long data.frame using reshape2::melt.

I leave turning this into a comma-separated data.frame up to you...

Maurits Evers
  • 49,617
  • 4
  • 47
  • 68
  • Why the downvotes? This *does* provide a valid solution, and should get OP on track. – Maurits Evers Dec 18 '17 at 22:46
  • 2
    Perhaps it's because the OP didn't show enough evidence of attempt. Maybe it is a discouragement of supporting that type of ethics. – Joseph K. Dec 18 '17 at 22:48
  • @JosephK. Ok; I agree with the lack of own initiative from OP; that's why I decided to give a solution outline plus explanation, rather than a complete solution. – Maurits Evers Dec 18 '17 at 22:52
  • 3
    @JosephK. The question has an explicit input and a desired output, which is better than 95% of R questions and enough to not close it as no reprex. Some failed attempts would be nice, yes, but this is not even in the bottom half of questions, particularly for a first one. Especially for new users, it's important to teach how to ask a better question, not just downvote the questions to oblivion, demoralizing someone who doesn't know better yet. – alistaire Dec 18 '17 at 23:15
3

Use a tokenizer, e.g. via tidytext::unnest_tokens:

library(tidyverse)
library(tidytext)

dialogue <- read_csv(
    'Mike, "Hello how are you"
     Sally, "Good you"', 
    col_names = c('speaker', 'sentence')
)

dialogue %>% unnest_tokens(word, sentence)
#> # A tibble: 6 x 2
#>   speaker  word
#>     <chr> <chr>
#> 1    Mike hello
#> 2    Mike   how
#> 3    Mike   are
#> 4    Mike   you
#> 5   Sally  good
#> 6   Sally   you
alistaire
  • 42,459
  • 4
  • 77
  • 117