0

I am implementing an lstm model where I have already trained my model with a dataset. When I am using my new dataset to predict the output, I am having errors because some words that are in my new dataset are not present in trained model. Is there any method to perform so that is the word is not found, it does not consider it?

Actually the words from the train model are saved in a dictionary as shown in my codes below:

df = pd.read_csv('C:/Users/User/Desktop/Coding/lstm emotion recognition/emotion.data/emotion.data')


#Preparing data for model traininng
#Tokenization-Since the data is already tokenized and lowecased, we just need to split the words
input_sentences = [text.split(" ") for text in df["text"].values.tolist()]
labels = df["emotions"].values.tolist()

#creating vocabulary(word index)
#Initialize word2id and label2id dictionaries that will be used to encode words and labels
word2id = dict() #creating the dictionary named word2id
label2id = dict() #creating a dictionary named label2id

max_words = 0 #maximum number of words in a sentence

#construction of word2id
for sentence in input_sentences:
    for word in sentence:
        #Add words to word2id if not exist
        if word not in word2id:
            word2id[word] = len(word2id)
    #If length of the sentence is greater than max_words, update max_words
    if len(sentence) > max_words:
        max_words = len(sentence)

#Construction of label2id and id2label dictionaries
label2id = {l: i for i, l in enumerate(set(labels))}
id2label = {v: k for k, v in label2id.items()}

from keras.models import load_model

model = load_model('modelsave2.py')
print(model)

import keras
model_with_attentions = keras.Model(inputs=model.input,
                                    output=[model.output,
                                              model.get_layer('attention_vec').output])
#File I/O Open function for read data from JSON File
with open('C:/Users/User/Desktop/Coding/parsehubjsonfileeg/all.json', encoding='utf8') as file_object:
        # store file data in object
        data = json.load(file_object)

        # dictionary for element which you want to keep
        new_data = {'selection1': []}
        print(new_data)
        # copy item from old data to new data if it has 'reviews'
        for item in data['selection1']:
            if 'reviews' in item:
                new_data['selection1'].append(item)
                print(item['reviews'])
                print('--')

        # save in file
        with open('output.json', 'w') as f:
            json.dump(new_data, f)
selection1 = data['selection1']

for item in selection1:
    name = item['name']
    print ('>>>>>>>>>>>>>>>>>> ', name)
    CommentID = item['reviews']
    for com in CommentID:
      comment = com['review'].lower()  # converting all to lowercase
      result = re.sub(r'\d+', '', comment)  # remove numbers
      results = (result.translate(
          str.maketrans('', '', string.punctuation))).strip()  # remove punctuations and white spaces
      comments = remove_stopwords(results)
      print('>>>>>>',comments)
    encoded_samples = [[word2id[word] for word in comments]]

      # Padding
      encoded_samples = keras.preprocessing.sequence.pad_sequences(encoded_samples, maxlen=max_words)

      # Make predictions
      label_probs, attentions = model_with_attentions.predict(encoded_samples)
      label_probs = {id2label[_id]: prob for (label, _id), prob in zip(label2id.items(), label_probs[0])}

      # Get word attentions using attenion vector
      print(label_probs)
      print(max(label_probs))

my output is:

>>>>>> ['amazing', 'stay', 'nights', 'cleanliness', 'room', 'faultless']
{'fear': 0.26750156, 'love': 0.0044763167, 'joy': 0.06064613, 'surprise': 0.32365623, 'sadness': 0.03203068, 'anger': 0.31168908}
surprise
>>>>>> ['good', 'time', 'food', 'good']
Traceback (most recent call last):
  File "C:/Users/User/PycharmProjects/Dissertation/loadandresult.py", line 96, in <module>
    encoded_samples = [[word2id[word] for word in comments]]
  File "C:/Users/User/PycharmProjects/Dissertation/loadandresult.py", line 96, in <listcomp>
    encoded_samples = [[word2id[word] for word in comments]]
KeyError: 'everydaythe'

the error is because the word 'everydaythe' is not found my my trained dataset,..What should i do to correct this? please help me guys. please

Nedisha
  • 153
  • 1
  • 1
  • 8

1 Answers1

0

You can add a the following condition inside the list comprehension:

encoded_samples = [[word2id[word] for word in comments if word in word2id.keys()]]

This one will only add the words in comments that are already present in the keys of the dictionary.

Edit:

When you're dealing with dictionaries, and facing a situation where you're trying to access a key which you're not sure exists for every dictionary, you can use get(). This method allows you to query a dictionary for a key, and if it doesn't exist, it will return a default value which you can choose, like in the code below:

my_dict = {'id': 0, 'reviews': 4.5}
your_dict = {'id': 1}

# If I just specify the key, the default return value is None
your_dict.get('reviews')

# However, I can specify the return value
your_dict.get('reviews', default=4.0)
João Amaro
  • 195
  • 1
  • 8
  • No problem, but it's sir, not madam :P – João Amaro Feb 20 '20 at 13:49
  • Thank you very much sir, I have another issue, can you please help me, I am having this error, https://i.stack.imgur.com/QnR55.png this is because in my json file, this hotel has no reviews..what should I do?? here is a part of my json file: https://i.stack.imgur.com/rEIjB.png and here is the hotel which has no reviews:https://i.stack.imgur.com/AOr49.png, how do i solve this? – Nedisha Feb 20 '20 at 13:53
  • Sorry sir, but am I really grateful to you, I got a good explanation from you. – Nedisha Feb 20 '20 at 13:55
  • I'll edit my response to provide you with the solution to the error. – João Amaro Feb 20 '20 at 13:56
  • Thank you again sir, I will let you know if it worked. – Nedisha Feb 20 '20 at 13:59
  • Sir, I have another issue about creating dataframe form my output..will you be able to help me? It is on this site: https://stackoverflow.com/questions/60345050/unable-to-create-dataframe-from-output-obtained – Nedisha Feb 21 '20 at 19:29