0

I´m dealing with a text mining task. Today, I have a problem with the stemming method. I have several paragraphs in this format. These are character object, not list neither Corpus object from tm package.

[1] " andres oppenheimer intelectuales influyentes latinoamerica segun revista foreign policy editor columnista miami herald sigue recorriendo continente presenta reportajes cnn tradicional ciclo periodistico argentina presentando libro salvese pueda analiza futuro mundo automatizacion robotizacion "

I have a dictionary where some words has to be match in the corpus above. The problem is that I couldn´t do it through the stemming method. My syntax is the following:

lexicon<- read.xlsx("lexicon nf.xlsx",sheetName = "lex",colIndex = 1,header = T)
lexicon$palabra<- as.character(lexicon$palabra)
match<- paste(lexicon$palabra[order(-nchar(lexicon$palabra))],collapse = "|^")

If I try:

match<- paste(lexicon$palabra[order(-nchar(lexicon$palabra))],collapse = "|")

It matches the word in any position, but this is not what I want. I know that if a split the words of the corpus by, for instance the space, I can make the match as I need, but this is a more complicated aproach. I wish to do it directly from the paragraph, But without turn it into a Corpus object.

Any idea? Thank you very much for your help!

Dave2e
  • 22,192
  • 18
  • 42
  • 50
pch919
  • 19
  • 3
  • 1
    It's very unclear to me from this description what you are doing. I don't see where you are doing stemming at all here. Make sure to provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and the desired output. What is it so important to avoid the `tm` package? – MrFlick Aug 20 '18 at 20:14
  • match: When I paste the vector "palabra" of the data.frame "lexicon" , I ordered it by number of characters and collapsed it with the character " |^ ". This is bacause with " ^ " I indicates that the match in the lexicon has to do it in every word in the Corpus, but beggining with the words in lexicon (root of the word). I couldn´t do the match in these paragraph wth " |^ ". If the only way to do it is with the tm library, well let´s do it. – pch919 Aug 21 '18 at 01:15
  • You could use udpipe (https://cran.r-project.org/web/packages/udpipe/index.html) to do tokenisation & lemmatisation for Spanish and next do a simple join with your lexicon. –  Aug 22 '18 at 19:57

0 Answers0