I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format.
What's the opposite / inverse command of unnest_tokens? I checked answer in another similar question asked on this forum and I can do the following:
if I wanted to get my text back to its original form after some processing in its tidied form, using map functions from purrr.
First, let's go from raw text to a tidied format.
library(tidyverse)
library(tidytext)
tidy_austen <- janeaustenr::austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number()) %>%
ungroup() %>%
unnest_tokens(word, text)
tidy_austen
#> # A tibble: 725,055 x 3
#> book linenumber word
#> <fctr> <int> <chr>
#> 1 Sense & Sensibility 1 sense
#> 2 Sense & Sensibility 1 and
#> 3 Sense & Sensibility 1 sensibility
#> 4 Sense & Sensibility 3 by
#> 5 Sense & Sensibility 3 jane
#> 6 Sense & Sensibility 3 austen
#> 7 Sense & Sensibility 5 1811
#> 8 Sense & Sensibility 10 chapter
#> 9 Sense & Sensibility 10 1
#> 10 Sense & Sensibility 13 the
#> # ... with 725,045 more rows
The text is tidy now! But we can untidy it, back to something sort of like its original form. I typically approach this using nest from tidyr, and then some map functions from purrr.
nested_austen <- tidy_austen %>%
nest(word) %>%
mutate(text = map(data, unlist),
text = map_chr(text, paste, collapse = " "))
nested_austen
#> # A tibble: 62,272 x 4
#> book linenumber data
#> <fctr> <int> <list>
#> 1 Sense & Sensibility 1 <tibble [3 x 1]>
#> 2 Sense & Sensibility 3 <tibble [3 x 1]>
#> 3 Sense & Sensibility 5 <tibble [1 x 1]>
#> 4 Sense & Sensibility 10 <tibble [2 x 1]>
#> 5 Sense & Sensibility 13 <tibble [12 x 1]>
#> 6 Sense & Sensibility 14 <tibble [13 x 1]>
#> 7 Sense & Sensibility 15 <tibble [11 x 1]>
#> 8 Sense & Sensibility 16 <tibble [12 x 1]>
#> 9 Sense & Sensibility 17 <tibble [11 x 1]>
#> 10 Sense & Sensibility 18 <tibble [15 x 1]>
#> # ... with 62,262 more rows, and 1 more variables: text <chr>
Please can someone help me to change the above code if I tokenize into n grams where n can be 2 or 3.
What I am trying to do is:
Step 1: Split text into trigrams
Step 2: View the trigrams and see which make sense (Here I need to check it manually and I will replace only those which make sense to me)
Step: 3 Replace these trigrams in original text as a single word joined by _
Step 4: Repeat above for bigrams
Step 5: Then tokenize again