10

This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search.

I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format.

What's the opposite / inverse command of unnest_tokens?

Edit: here is what the data I'm working with look like. I'm trying to replicate analyses from Silge and Robinson's Tidy Text book but using Italian opera librettos.

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

I turn it into tidy text so I can get rid of stop words:

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

Now I have something like this:

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

I would like to get it back into the format of character name and the associated line to look at other things. Basically I would like the text in the same format it was before, but with stop words removed.

Kate
  • 512
  • 4
  • 12
  • Hi, please read [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and edit your question. Knowing more about what your data are like and what you did will make it possible for other users to help you. – shea Oct 13 '17 at 19:14

2 Answers2

18

Not a stupid question! The answer depends a bit on exactly what you are trying to do, but here would be my typical approach if I wanted to get my text back to its original form after some processing in its tidied form, using the group_by() function from dplyr.

First, let's go from raw text to a tidied format.

library(tidyverse)
library(tidytext)

tidy_austen <- janeaustenr::austen_books() %>%
    group_by(book) %>%
    mutate(linenumber = row_number()) %>%
    ungroup() %>%
    unnest_tokens(word, text)

tidy_austen
#> # A tibble: 725,055 x 3
#>    book                linenumber word       
#>    <fct>                    <int> <chr>      
#>  1 Sense & Sensibility          1 sense      
#>  2 Sense & Sensibility          1 and        
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3 by         
#>  5 Sense & Sensibility          3 jane       
#>  6 Sense & Sensibility          3 austen     
#>  7 Sense & Sensibility          5 1811       
#>  8 Sense & Sensibility         10 chapter    
#>  9 Sense & Sensibility         10 1          
#> 10 Sense & Sensibility         13 the        
#> # … with 725,045 more rows

The text is tidy now! But we can untidy it, back to something sort of like its original form. I typically approach this using group_by() and summarize() from dplyr, and str_c() from stringr. What does the text look like at the end, in this particular case?

tidy_austen %>% 
    group_by(book, linenumber) %>% 
    summarize(text = str_c(word, collapse = " ")) %>%
    ungroup()
#> # A tibble: 62,272 x 3
#>    book            linenumber text                                         
#>    <fct>                <int> <chr>                                        
#>  1 Sense & Sensib…          1 sense and sensibility                        
#>  2 Sense & Sensib…          3 by jane austen                               
#>  3 Sense & Sensib…          5 1811                                         
#>  4 Sense & Sensib…         10 chapter 1                                    
#>  5 Sense & Sensib…         13 the family of dashwood had long been settled…
#>  6 Sense & Sensib…         14 was large and their residence was at norland…
#>  7 Sense & Sensib…         15 their property where for many generations th…
#>  8 Sense & Sensib…         16 respectable a manner as to engage the genera…
#>  9 Sense & Sensib…         17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib…         18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows

Created on 2019-07-11 by the reprex package (v0.3.0)

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
8
library(tidyverse)
tidy_austen %>% 
     group_by(book,linenumber) %>% 
     summarise(text = str_c(word, collapse = " "))
Rakesh Kumar
  • 161
  • 2
  • 9
  • Can you explain how this answers the question? – Stephen Rauch Jun 21 '18 at 00:06
  • Unnest_token is a simple operation of separating words and arranging row wise. And the above operation is exactly opposite to it, collapsing words separated by space and grouping them together based on the common key. – Rakesh Kumar Jun 22 '18 at 02:21
  • This really is the more obvious and cleaner solution. It also happens to be faster. The answer itself lacks completeness and detail, but IMO `group_by` and `summarize` is much more readable than the `nest` and `mutate` strategy. – Ista Oct 17 '18 at 13:10
  • I had just come back to this question to update my answer, because I found that `str_c()` worked for this within `summarize()`. Nice! – Julia Silge Jul 11 '19 at 13:12