Reconstruct news texts from Keras' reuters dataset

Question

I cant seem to make sense of the dataset provided by Keras' reuters dataset.

The set is loaded like so:

(x_train, y_train), (x_test, y_test) = reuters.load_data()

As far as I understand the "x" arrays are arrays of sequences (lists) of word indices from news stories and the "y" arrays are arrays of the topics of these sequences.

But when I try to translate the word indices of one of the sequences with the provided dictionary into actual words:

wordDict = {y:x for x,y in reuters.get_word_index().items()}  
for index in x_train[0]:
    print (wordDict.get(index))

The sequence seems to make no sense. How do I turn the sequences back into the original news?

Edit: found a similar thread here. Seems like there is a problem with the indices in the dictionary not matching the word indices in the dataset. But redownloading the data does not resolve the problem for me.

See cell 6 in https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/3.6-classifying-newswires.ipynb — Alex Ott, Oct 22 '17 at 17:14

score 2 · Accepted Answer · answered Oct 21 '17 at 19:18

2

The default value for the load_data argument "index_from" lets the indices of actual word to >3. One can reconstruct the texts by using wordDict.get(index - 3).

answered Oct 21 '17 at 19:18

AstronAUT

639
2
7
11

Reconstruct news texts from Keras' reuters dataset

1 Answers1