R Lime package for text data

Question

I was exploring the use of R lime on text datasets to explain black box model predictions and came across an example https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html

Was testing on a restaurant review dataset but found some that the plot_features produced doesn't print all the features. I was wondering if anyone could provide any advice/insights for me on this as to why this happens or recommend a different package to use. Help here is greatly appreciated since not much work on R lime can be found online. Thanks!

Dataset: https://drive.google.com/file/d/1-pzY7IQVyB_GmT5dT0yRx3hYzOFGrZSr/view?usp=sharing

# Importing the dataset
dataset_original = read.delim('Restaurant_Reviews.tsv', quote = '', stringsAsFactors = FALSE)

# Cleaning the texts
# install.packages('tm')
# install.packages('SnowballC')
library(tm)
library(SnowballC)
corpus = VCorpus(VectorSource(dataset_original$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords())
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)

# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = dataset_original$Liked

# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

library(caret)
model <- train(Liked~., data=training_set, method="xgbTree")

######
#LIME#
######
library(lime)
explainer <- lime(training_set, model)
explanation <- explain(test_set[1:4,], explainer, n_labels = 1, n_features = 5)
plot_features(explanation)

My undesired output: https://www.dropbox.com/s/pf9dq0kba0d5flt/Udemy_NLP_Lime.jpeg?dl=0

What I want (different dataset): https://www.dropbox.com/s/e1472i4yw1owmlc/DMT_A5_lime.jpeg?dl=0

Just a minor update. Still not able to solve it, but i am guessing the problem is due to sparsity of a matrix. Still need help with this. — Lacri Mosa, Jun 09 '18 at 11:30

Sam S. · Answer 1 · 2018-11-09T03:07:20.257

I could not open the links you provided for the dataset and output. However, I am using the same link you provided https://cran.r-project.org/web/packages/lime/vignettes/Understanding_lime.html . I use text2vec, as it is in the link, and xgboost package for classification; and it works for me. To display more features, you may need to increase the value of n_features in explain function, see https://www.rdocumentation.org/packages/lime/versions/0.4.0/topics/explain .

library(lime)
library(xgboost)  # the classifier
library(text2vec) # used to build the BoW matrix

# load data
data(train_sentences, package = "lime")  # from lime 
data(test_sentences, package = "lime")   # from lime

# Tokenize data
get_matrix <- function(text) {
  it <- text2vec::itoken(text, progressbar = FALSE)

  # use the following lines if you want to prune vocabulary
  # vocab <- create_vocabulary(it, c(1L, 1L)) %>%   
  # prune_vocabulary(term_count_min = 10, doc_proportion_max = 0.2)
  #   vectorizer <- vocab_vectorizer(vocab )

  # there is no option to prune the vocabulary, but it is very fast for big data
  vectorizer <- hash_vectorizer(hash_size = 2 ^ 10, ngram = c(1L, 1L))
  text2vec::create_dtm(it,vectorizer = vectorizer) # hash_vectorizer())
}

# BoW matrix generation
# features should be the same for both dtm_train and dtm_test 
dtm_train <- get_matrix(train_sentences$text)
dtm_test  <- get_matrix(test_sentences$text) 

# xgboost for classification
param <- list(max_depth = 7, 
          eta = 0.1, 
          objective = "binary:logistic", 
          eval_metric = "error", 
          nthread = 1)

xgb_model <-xgboost::xgb.train(
  param, 
  xgb.DMatrix(dtm_train, label = train_sentences$class.text == "OWNX"),
  nrounds = 100 
)

# prediction
predictions <- predict(xgb_model, dtm_test) > 0.5
test_labels <- test_sentences$class.text == "OWNX"

# Accuracy
print(mean(predictions == test_labels))

# what are the most important words for the predictions.
n_features <- 5 # number of features to display
sentence_to_explain <- head(test_sentences[test_labels,]$text, 6)
explainer <- lime::lime(sentence_to_explain, model = xgb_model, 
                    preprocess = get_matrix)
explanation <- lime::explain(sentence_to_explain, explainer, n_labels = 1, 
                         n_features = n_features)

#
explanation[, 2:9]

# plot
lime::plot_features(explanation)

In your code, NAs are created in the following line, when applying on train_sentences dataset. Please check your code for the following.

dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

Removing levels or changing levels to labels works for me.

Please check your data structure and make sure your data is not zero matrix due to those NAs, or it is not too sparse. It may also cause the problem as it cannot find top n features.

Thanks for your input. However, the reason why I did not use the text2vec method with xgboost is I did not want to use the hashing trick, and I wanted to try using other packages on 'train' (e.g. randomforest, knn etc). I still have not been able to solve this using packages other than xgboost for this one. — Lacri Mosa, Nov 07 '18 at 05:28
Thanks for your very informative question and comment. text2vec or tm package are for creating document-term-matrix, i.e. dtm_train and dtm_test. Then we can employ any classification techniques (such as methods mentioned in https://rdrr.io/cran/caret/man/models.html for classification) using caret package like you did in your post, model <- train(dtm_train, train_sentences$class.text, method="svmLinear") . — Sam S., Nov 08 '18 at 12:49
Please see the below link and a solution with tm package: https://stackoverflow.com/questions/51296577/r-explain-on-lime-feature-names-stored-in-object-and-newdata-are-different . — Sam S., Nov 09 '18 at 03:14

R Lime package for text data

1 Answers1

Linked