I am using the R package udpipe
to extract keywords in my data frame. Let's start with some data contained in the package:
library(udpipe)
data(brussels_reviews)
If we look at the structure, we see it contains 1500 comments (rows) and 4 columns.
str(brussels_reviews)
'data.frame': 1500 obs. of 4 variables:
$ id : int 32198807 12919832 23786310 20048068 17571798 28394425 46322841 27719650 14512388 37675819 ...
$ listing_id: int 1291276 1274584 1991750 2576349 1866754 5247223 7925019 4442255 2863621 3117760 ...
$ feedback : chr "Gwen fue una magnifica anfitriona. El motivo de mi viaje a Bruselas era la busqueda de un apartamento y Gwen me"| __truncated__ "Aurelie fue muy atenta y comunicativa. Nos dio mapas, concejos turisticos y de transporte para disfrutar Brusel"| __truncated__ "La estancia fue muy agradable. Gabriel es muy atento y esta dispuesto a ayudar en todo lo que necesites. La cas"| __truncated__ "Excelente espacio, excelente anfitriona, un lugar accessible economicamente y cerca de los lugares turisticos s"| __truncated__ ...
$ language : chr "es" "es" "es" "es" ...
When following this tutorial, I can extract keywords of all the data frame together. Excellent.
However, my requirement is to extract keywords in every row, and not all the data frame as a whole.
I acknowledge that with this example, it does not make much sense, as there is only one single column with text (feedback
). However, in my real example, I have plenty of columns with text.
So, I would like to extract keywords in every row of the data frame. So if we extract keywords in this example, I would like to get 1500 groups of keywords, each group for each row.
How can I do it?
UPDATE with and EXAMPLE
Following these two steps, we get the keywords of all the dataframe. However, I would like to get the keywords in every row of the data frame.
First step
library(udpipe)
library(textrank)
## First step: Take the Spanish udpipe model and annotate the text. Note: this takes about 3 minutes
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")
ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback)
x <- as.data.frame(x)
Second step
## Collocation (words following one another)
stats <- keywords_collocation(x = x,
term = "token", group = c("doc_id", "paragraph_id", "sentence_id"),
ngram_max = 4)
## Co-occurrences: How frequent do words occur in the same sentence, in this case only nouns or adjectives
stats <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")),
term = "lemma", group = c("doc_id", "paragraph_id", "sentence_id"))
## Co-occurrences: How frequent do words follow one another
stats <- cooccurrence(x = x$lemma,
relevant = x$upos %in% c("NOUN", "ADJ"))
## Co-occurrences: How frequent do words follow one another even if we would skip 2 words in between
stats <- cooccurrence(x = x$lemma,
relevant = x$upos %in% c("NOUN", "ADJ"), skipgram = 2)