2

I am using the function keywords_rake from the udpipe package (for R) to extract keywords from a bunch of documents.

udmodel_en <- udpipe_load_model(file = dl$file_model)
x <- udpipe_annotate(udmodel_en, x = data$text)
x <- as.data.frame(x)

keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"), ngram_max = 2)

where data looks like this

  Text
  "cats are nice but dogs are better..."
  "I really like dogs..."
  "red flowers are pretty, especially roses..."
  "once I saw a blue whale ..."
  ....

(each row is a separate document)

However the output does not include the origin of the keywords, and provides a list of keywords for all the documents

how can I link these keywords to the corresponding documents they were taken from? (I.e. have a list of keywords for each of the documents)

something like this:

      keywords
doc1   dog, cat, blue whale
doc2   dog 
doc3   red flower, tower, Donald Trump 
Carbo
  • 906
  • 5
  • 23

1 Answers1

4

You can use txt_recode_ngram together with the outcome of keywords_rake to do this. The advantage is that everything is back in the original data.frame and you can then select what you need. See example below using the dataset supplied with udpipe.

Disclaimer: Code copied from jwijffels' answer in issue 41 on the github page of udpipe.

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"), sep = "-")
head(keywords)

            keyword ngram freq     rake
1  openbaar-vervoer     2   19 2.391304
2         heel-fijn     2    2 2.236190
3  heel-vriendelijk     2    3 2.131092
4 herhaling-vatbaar     2    6 2.000000
5  heel-appartement     2    2 1.935450
6 steenworp-afstand     2    4 1.888889

x$term <- txt_recode_ngram(x$lemma, compound = keywords$keyword, ngram = keywords$ngram, sep = "-")
x$term <- ifelse(!x$term %in% keywords$keyword, NA, x$term)
head(x[!is.na(x$term), ])

        doc_id language sentence_id token_id       token      lemma xpos                term
67039 19991431       nl        4379       11         erg        erg   JJ        erg-centraal
67048 19991431       nl        4379       20        leuk       leuk   JJ          leuk-adres
67070 21054450       nl        4380        6       goede       goed   JJ        goed-locatie
67077 21054450       nl        4380       13    Europese   europees   JJ       europees-wijk
67272 23542577       nl        4393       84 uitstekende uitstekend   JJ uitstekend-gastheer
67299 40676307       nl        4396       25   gezellige   gezellig   JJ      gezellig-buurt
phiver
  • 23,048
  • 14
  • 44
  • 56