1

I'm trying to migrate a script from using tm to quanteda. Reading the quanteda documentation there is a philosophy about applying changes "downstream" so that the original corpus is unchanged. OK.

I previously wrote a script to find spelling mistakes in our tm corpus and had support from our team to create a manual lookup. So, I have a csv file with 2 columns, the first column is the misspelt term and the second column is the correct version of that term.

Using tm package previously I did this:

# Write a custom function to pass to tm_map
# "Spellingdoc" is the 2 column csv
library(stringr)
library(stringi)
library(tm)
stringi_spelling_update <- content_transformer(function(x, lut = spellingdoc) stri_replace_all_regex(str = x, pattern = paste0("\\b", lut[,1], "\\b"), replacement = lut[,2], vectorize_all = FALSE))

Then within my tm corpus transformations I did this:

mycorpus <- tm_map(mycorpus, function(i) stringi_spelling_update(i, spellingdoc))

What is the equivilent way to apply this custom function to my quanteda corpus?

Doug Fir
  • 19,971
  • 47
  • 169
  • 299

2 Answers2

1

Impossible to know if that will work from your example, which leaves some parts out, but generally:

If you want to access texts in a quanteda corpus, you can use texts(), and to replace those texts, texts()<-.

So in your case, assuming that mycorpus is a tm corpus, you could do this:

library("quanteda")
stringi_spelling_update2 <- function(x, lut = spellingdoc) {
    stringi::stri_replace_all_regex(str = x, 
                                    pattern = paste0("\\b", lut[,1], "\\b"), 
                                    replacement = lut[,2], 
                                    vectorize_all = FALSE)
}

myquantedacorpus <- corpus(mycorpus)
texts(mycorpus) <- stringi_spelling_update2(texts(mycorpus), spellingdoc)
Ken Benoit
  • 14,454
  • 27
  • 50
  • Hi @Ken, actually mycorpus is a quanteda corpus. I'm just learning about the package recently. I think your second sentence is what I was looking for? However, for this particular problem I noticed the dictionary functionality you provide for dfm() so I used that instead, but good to know that if I need to apply a custom function to each doc I go ```texts(mycorpus) <- myCustomFunction(myCorpus))``` (though I should avoid that if sticking to quanteda philosophy of not changing the corpus) – Doug Fir Aug 30 '17 at 16:16
  • 1
    Cleaning text in a corpus is still consistent with the non-destructive workflow principles of **quanteda**, if the corpus contains spelling mistakes that you are never interested in (such as the product of OCR errors). What we want to discourage is people applying stemmers or removing stopwords from the corpus itself. – Ken Benoit Aug 30 '17 at 16:34
0

I think I found an indirect answer over here.

texts(myCorpus) <- myFunction(myCorpus)
Doug Fir
  • 19,971
  • 47
  • 169
  • 299