I'm trying to use the PyTorch Dataset. It works fine on my laptop. But when i run it on the Server, it cannot iterate anything. When I tried to print the data in the getitem, it shows that
Traceback (most recent call last):
File "test.py", line 98, in <module>
print(fileDataSet[0])
File "/home/cjunjie/NLP/DocSummarization/dataset.py", line 32, in __getitem__
print(abstracts)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u22f1' in position 273: ordinal not in range(256)
What I did is to read data from a file that only contains integer and comma in a PyTorch Dataset:
def __getitem__(self, index):
abstracts = np.genfromtxt(os.path.join(self.abstract_path, self.abstract_list[index]),
delimiter=',').astype(int)
abstracts_length = abstracts[:, 0].flatten()
print(abstracts) // It works
abstracts = torch.LongTensor(abstracts[:, 1:])
// Error
print(abstracts)
articles = np.genfromtxt(os.path.join(self.article_path, self.article_list[index]),
delimiter=',').astype(int)
articles_length = articles[:, 0].flatten()
articles = torch.LongTensor(articles[:, 1:])
return {'abstracts_data': abstracts, 'abstracts_length': abstracts_length, 'articles_data': articles,
'articles_length': articles_length}
So it seems that there is something wrong with the LongTensor, but I don't know why it is wrong.