3

I am using udpipe package in R to make some text mining. I have followed this tutorial : https://cran.r-project.org/web/packages/udpipe/vignettes/udpipe-usecase-postagging-lemmatisation.html#nouns__adjectives_used_in_same_sentence but now, I am a bit stuck.

Indeed, I would like to group more than two words, to be able to identify, for example, key-expressions like "from dusk till dawn".

So, I was wondering if, based on the graph in the tuto above, it was possible to do a kind of clustering algorithm to "merge" the words that are strongly - and frequently ! - linked together ? If yes, how ?

Is there an other way to do that ?

Thanks

MysteryGuy
  • 1,091
  • 2
  • 18
  • 43
  • 1
    `n-grams` is the keyword you're looking for – moodymudskipper Mar 26 '18 at 20:20
  • 2
    Did you look at the `ego()` and `clicques()` functions from the `igraph` package. Try `cliques(wordnetwork, min = 2, max = NULL)` and `ego(wordnetwork)`. The results are what you expect? – nghauran Mar 27 '18 at 09:51
  • @Moody_Mudskipper are you sure that `n-gram`package (I guess it is what you meant) can extract "key expressions" from a corpus whatever length they are? – MysteryGuy Mar 27 '18 at 16:33
  • @ANG Thanks! I guess this function basically finds the cliques in the graph of the words connected together. However, do I really need cliques or rather connected component (i.e. in the key expression `proof of concept`, there are three words; `proof`is directly next to `of`and `of`next to `concept` but the three are not all next to each other...) Or maybe should I use a larger "window" ? – MysteryGuy Mar 27 '18 at 16:39
  • 1
    I wasn't referring to a specific package, just that these sets of words you're looking (words that are found one after an another) for are named n grams, associations (answer below) are something else, it's words that are find together in the items of your corpus. – moodymudskipper Mar 27 '18 at 16:45
  • This should give you pointers : https://www.tidytextmining.com/ngrams.html . I'd be happy to answer the question, but not without a reproducible example and expected output :). – moodymudskipper Mar 27 '18 at 16:47
  • 1
    If I got you right, you should be interested in ego-networks. In you example of a network such that `proof -> of` and `of -> concept`, the 2 level ego-network of `proof` will contain `of` and `concept` even if `proof` and `concept` are not directly connected – nghauran Mar 27 '18 at 16:57
  • @Moody_Mudskipper Okay, I'll look at it. Actually, my biggest issue is to be able to extract "key-expressions" with more than two words (I guess it is due to the maths behind that...). I am not sure that `rake`and `tidytextmining`really solve the problem – MysteryGuy Mar 27 '18 at 16:58
  • @ANG Cool, I will have a look at that when working on it. Thanks :-) – MysteryGuy Mar 27 '18 at 16:59
  • 2
    The link I gave you gives you an example with bigrams, you can use the exact same example and set n to 4 instead of 2 and you'll have 4grams... – moodymudskipper Mar 27 '18 at 17:06

1 Answers1

4

Here are two options (using ego-networks and community detection) based on the tutorial you provided.

library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)


cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                     term = "lemma", 
                     group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)

library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooc, 30)
wordnetwork <- graph_from_data_frame(wordnetwork)
ggraph(wordnetwork, layout = "fr") +
        geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
        geom_node_text(aes(label = name), col = "darkgreen", size = 4) +
        theme_graph(base_family = "Arial Narrow") +
        theme(legend.position = "none") +
        labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")


### Option 1: using ego-networks
V(wordnetwork) # the graph has 23 vertices
ego(wordnetwork, order = 2) # 2.0 level ego network for each vertex
ego(wordnetwork, order = 1, nodes = 10) # 1.0 level ego network for the 10th vertex (publico)


### Option 2: using community detection

# Community structure detection based on edge betweenness (http://igraph.org/r/doc/cluster_edge_betweenness.html)
cluster_edge_betweenness(wordnetwork, weights = E(wordnetwork)$cooc)

# Community detection via random walks (http://igraph.org/r/doc/cluster_walktrap.html)
cluster_walktrap(wordnetwork, weights = E(wordnetwork)$cooc, steps = 2)

# Community detection via optimization of modularity score
# This works for undirected graphs only
wordnetwork2 <- as.undirected(wordnetwork) # an undirected graph
cluster_fast_greedy(wordnetwork2, weights = E(wordnetwork2)$cooc)

# Note that you can plot community object
comm <- cluster_fast_greedy(wordnetwork2, weights = E(wordnetwork2)$cooc)
plot_dendrogram(comm)

enter image description here

nghauran
  • 6,648
  • 2
  • 20
  • 29
  • 2
    Although I think the person asking the question is only interested in the function keywords_rake / keywords_collocation / keywords_phrases / textrank_keywords and should have a look more in detail to the argument of these functions, I really like this approach of word clustering. Very interesting use case of word clustering! Thanks for sharing! –  Mar 27 '18 at 19:43
  • @jwijffels I took a look at keywords_phrases : https://rdrr.io/cran/udpipe/man/keywords_phrases.html but it seems a bit "heavy" to configure the pattern(s)... Is there a simpler way ? – MysteryGuy Mar 28 '18 at 13:45