1

I am trying to learn tidytext. I can follow the examples on tidytext website so long as I use the packages (janeaustenr, eg). However, most of my data are text files in a corpus. I can reproduce the tm to tidytext conversion example for sentiment analysis (ap_sentiments) on the tidytext website. I am having trouble, however, understanding how the tidytext data are structured. For example, the austen novels are stored by "book" in the austenr package. For my tm data, however, what is the equivalent for calling the vector for book? Here is the specific example for my data:

'cname <- file.path(".", "greencomments" , "all")

I can then use tidytext successfully after running the tm preprocessing:

practice <- tidy(tdm)
practice
partysentiments <- practice %>%
inner_join(get_sentiments("bing"), by = c(term = "word"))
partysentiments

# A tibble: 170 x 4
term    document count sentiment
<chr>   <chr>    <dbl> <chr>    
1 benefit 1         1.00 positive 
2 best    1         2.00 positive 
3 better  1         7.00 positive 
4 cheaper 1         1.00 positive 
5 clean   1        24.0  positive 
7 clear   1         1.00 positive 
8 concern 1         2.00 negative 
9 cure    1         1.00 positive 
10 destroy 1         3.00 negative 

But, I can't reproduce the simple ggplots of word frequencies in tidytext. Since my data/corpus are not arranged with a column for "book" in the dataframe, the code (and therefore much of the tidytext functionality) won't work.

Here is an example of the issue. This works fine:

practice %>%
count(term, sort = TRUE)

# A tibble: 989 x 2
term        n
<chr>   <int>
1 activ       3
2 air         3
3 altern      3

but, what how to I arrange the tm corpus to match the structure of the books in the austenr package? Is "document" the equivalent of "book"? I have text files in folders for the corpus. I have tried replacing this in the code, and it doesn't work. Maybe I need to rename this? Apologies in advance - I am not a programmer.

dcoffey
  • 11
  • 3
  • Can you clarify what is in "greencomments"? I don't think you need to use tm at all; you can use another, simpler approach to get your data into memory. – Julia Silge Nov 16 '18 at 21:53
  • the "green comments" folder contains open-ended survey responses; the "all" subfolder has, well, all survey responses (the other subfolders have text files in which responses are for subgroups in the survey sample). Thanks for responding. – dcoffey Nov 17 '18 at 04:04
  • What code are you using to load the documents? Because using VCorpus(DirSource(.my_directory.)) should work to bring in the document names. Then using `tidy` on the corpus should give you a tidy data.frame with an id column containing the document names and a text column with the text. And some other columns as well. After that you can just use tidytext as in the examples in the book. – phiver Nov 17 '18 at 11:39
  • If you are dealing with text files, I would recommend using read_lines() to read them in, instead of loading the tm package and converting back and forth. It will be much faster and simpler. You can use this approach, but read_lines() instead of read_csv(): https://stackoverflow.com/a/40943207/5468471 – Julia Silge Nov 17 '18 at 17:31

0 Answers0