I am trying to make use of Python's NLTK package from within R using the Reticulate package. For the most part, I have been successful.
Now, I would like to perform named entity recognition (i.e. to determine which tokens represent named entities and what type of named entity they represent.) using NLTK's ne_chunk()
function. My problem is that the function returns an object of the class nltk.tree.Tree
, which I cannot figure out how to parse in R.
If ne_chunk()
is fed up to ten token-tag pairs, it will return a result which can be converted into a character using as.character()
, which can be parsed via regular expression functions (this is just a hack and I am not satisfied with it). Over ten pairs, however, and it will return a shorthand representation of the tree, from which no meaningful data can be extracted using R methods.
Here is a minimally-reproducible example:
library(reticulate)
nltk <- import("nltk")
sent_tokenize <- function(text, language = "english") {
nltk$tokenize$sent_tokenize(text, language)
}
word_tokenize <- function(text, language = "english", preserve_line = FALSE) {
nltk$tokenize$word_tokenize(text, language, preserve_line)
}
pos_tag <- function(tokens, tagset = NULL, language = "eng") {
nltk$pos_tag(tokens, tagset, language)
}
ne_chunk <- function(tagged_tokens, binary = FALSE) {
nltk$ne_chunk(tagged_tokens, binary)
}
text <- "Christopher is having a difficult time parsing NLTK Trees in R."
tokens <- word_tokenize(text)
tagged_tokens <- pos_tag(tokens)
ne_tagged_tokens <- ne_chunk(tagged_tokens)
Here is the shorthand that is returned when the text from the previous example is processed:
> ne_tagged_tokens
List (11 items)
Here are the classes to which ne_tagged_tokens
belongs:
> class(ne_tagged_tokens)
[1] "nltk.tree.Tree" "python.builtin.list" "python.builtin.object"
I am not interested in suggestions to use alternative, pre-existing R packages.