Text classification with own word embeddings using Neural Networks in R

Question

This is a rather lengthy one, so please bear with me, unfortunately enough the error occurs right at the very end...I cannot predict on the unseen test set!

I would like to perform text classification with word embeddings (that I have trained on my data set) that are embedded into neural networks. I simply have column with textual descriptions = input and four different price classes = target.

For a reproducible example, here are the necessary data set and the word embedding:

DF: https://www.dropbox.com/s/it0jsbv8e7nkryt/DF.csv?dl=0

WordEmb: https://www.dropbox.com/s/ia5fmio2e0plwkr/WordEmb.txt?dl=0

And here my code:

set.seed(2077)
DF = read.delim("DF.csv", header = TRUE, sep = ",",
                             dec = ".", stringsAsFactors = FALSE)
DF <- DF[,-1]

# parameters
max_num_words = 9000         # simply see number of observations
validation_split = 0.3
embedding_dim = 300

##### Data Preparation #####

# split into training and test set

set.seed(2077)
n <- nrow(DF)
shuffled <- DF[sample(n),]

# Split the data in train and test
train <- shuffled[1:round(0.7 * n),]
test <- shuffled[(round(0.7 * n) + 1):n,]
rm(n, shuffled)

# predictor/target variable
x_train <- train$Description
x_test <- test$Description

y_train <- train$Price_class
y_test <- test$Price_class

### encode target variable ###

# One hot encode training target values
trainLabels <- to_categorical(y_train) 
trainLabels <- trainLabels[, 2:5]

# One hot encode test target values
testLabels <- keras::to_categorical(y_test)
testLabels <- testLabels[, 2:5]

### encode predictor variable ###

# pad sequences
tokenizer <- text_tokenizer(num_words = max_num_words)

# finally, vectorize the text samples into a 2D integer tensor

set.seed(2077)
tokenizer %>% fit_text_tokenizer(x_train)
train_data <- texts_to_sequences(tokenizer, x_train)
tokenizer %>% fit_text_tokenizer(x_test)
test_data <- texts_to_sequences(tokenizer, x_test)

# determine average length of document -> set as maximal sequence length
seq_mean <- stri_count(train_data, regex="\\S+")
mean((seq_mean))

max_sequence_length = 70

# This turns our lists of integers into a 2D integer tensor of shape`(samples, maxlen)`

x_train <- keras::pad_sequences(train_data, maxlen = max_sequence_length)
x_test <- keras::pad_sequences(test_data, maxlen = max_sequence_length)

word_index <- tokenizer$word_index
Encoding(names(word_index)) <- "UTF-8"

#### PREPARE EMBEDDING MATRIX ####

embeddings_index <- new.env(parent = emptyenv())
lines <- readLines("WordEmb.txt")
for (line in lines) {
  values <- strsplit(line, ' ', fixed = TRUE)[[1]]    
  word <- values[[1]]
  coefs <- as.numeric(values[-1])
  embeddings_index[[word]] <- coefs
}

embedding_dim <- 300
embedding_matrix <- array(0,c(max_num_words, embedding_dim))
for(word in names(word_index)){
  index <- word_index[[word]]
  if(index < max_num_words){
    embedding_vector <- embeddings_index[[word]]
    if(!is.null(embedding_vector)){
      embedding_matrix[index+1,] <- embedding_vector  
    }
  }
}

##### Convolutional Neural Network #####

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
num_words <- min(max_num_words, length(word_index) + 1)

embedding_layer <- keras::layer_embedding(
  input_dim = num_words,
  output_dim = embedding_dim,
  weights = list(embedding_matrix), 
  input_length = max_sequence_length,
  trainable = FALSE
)

# train a 1D convnet with global maxpooling
sequence_input <- layer_input(shape = list(max_sequence_length), dtype='int32')

preds <- sequence_input %>%
  embedding_layer %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 5) %>% 
  layer_conv_1d(filters = 128, kernel_size = 1, activation = 'relu') %>% 
  layer_max_pooling_1d(pool_size = 2) %>% 
  layer_flatten() %>% 
  layer_dense(units = 128, activation = 'relu') %>% 
  layer_dense(units = 4, activation = 'softmax')

model <- keras_model(sequence_input, preds)

model %>% compile(
  loss = 'categorical_crossentropy',
  optimizer = 'adam',
  metrics = c('acc')  
)

model %>% keras::fit(
  x_train,                              
  trainLabels,                          
  batch_size = 1024,
  epochs = 20,
  validation_split = 0.3
)

Now here is where I get stuck: I cannot use the results of the NN to predict on the unseen test data set:

# Predict the classes for the test data
classes <- model %>% predict_classes(x_test, batch_size = 128)

I get this error: 
Error in py_get_attr_impl(x, name, silent) : 
  AttributeError: 'Model' object has no attribute 'predict_classes'

Afterwards, I'd proceed like this:

# Confusion matrix
table(y_test, classes)

# Evaluate on test data and labels
score <- model %>% evaluate(x_val, testLabels, batch_size = 128) 

# Print the score
print(score)

For now the actual accuracy does not really matter since this is only a small example of my data set.

I know this is a long one but AAANNY help would be very muuuch appreciated.

Are you using `keras`? `AttributeError` makes it sound like the problem is from Keras, not R specifically, in which case this question seems to be about the same problem you're having: [AttributeError: 'Model' object has no attribute 'predict_classes'](https://stackoverflow.com/questions/44806125/attributeerror-model-object-has-no-attribute-predict-classes) — divibisan, Mar 22 '19 at 16:05
Yes, I'm using keras. Thanks for the pointer - I just needed to use predict instead of predict_classes...but the accuracy is really bad, which is worrying me. But thanks a lot!! — MasterChief5773, Mar 22 '19 at 18:35
Possible duplicate of [AttributeError: 'Model' object has no attribute 'predict\_classes'](https://stackoverflow.com/questions/44806125/attributeerror-model-object-has-no-attribute-predict-classes) — divibisan, Mar 22 '19 at 18:37

Text classification with own word embeddings using Neural Networks in R

0 Answers0