0

I'm trying to use the PyTorch Dataset. It works fine on my laptop. But when i run it on the Server, it cannot iterate anything. When I tried to print the data in the getitem, it shows that

Traceback (most recent call last):
File "test.py", line 98, in <module>
print(fileDataSet[0])
File "/home/cjunjie/NLP/DocSummarization/dataset.py", line 32, in __getitem__
  print(abstracts)
UnicodeEncodeError: 'latin-1' codec can't encode character '\u22f1' in position 273: ordinal not in range(256)

What I did is to read data from a file that only contains integer and comma in a PyTorch Dataset:

def __getitem__(self, index):
    abstracts = np.genfromtxt(os.path.join(self.abstract_path, self.abstract_list[index]),
                              delimiter=',').astype(int)
    abstracts_length = abstracts[:, 0].flatten()

    print(abstracts) // It works

    abstracts = torch.LongTensor(abstracts[:, 1:])

    // Error
    print(abstracts)

    articles = np.genfromtxt(os.path.join(self.article_path, self.article_list[index]),
                             delimiter=',').astype(int)
    articles_length = articles[:, 0].flatten()
    articles = torch.LongTensor(articles[:, 1:])

    return {'abstracts_data': abstracts, 'abstracts_length': abstracts_length, 'articles_data': articles,
            'articles_length': articles_length}

So it seems that there is something wrong with the LongTensor, but I don't know why it is wrong.

Shiloh_C
  • 197
  • 4
  • 13
  • Take a look at [this question about stdout encoding](https://stackoverflow.com/q/4374455/8718773), it may help you. – Giacomo Alzetta Mar 12 '18 at 10:03
  • another similar question for you.. https://stackoverflow.com/questions/38602628/unicodeencodeerror-latin-1-codec-cant-encode-characters-in-position-0-5-ord – johnashu Mar 12 '18 at 10:04
  • Thank you, but I realized that this is not about the string. – Shiloh_C Mar 12 '18 at 10:31
  • It may be a known issue [here](https://github.com/pytorch/pytorch/issues/1660) – Ubdus Samad Mar 12 '18 at 10:35
  • @UbdusSamad Thank you. It is the same issue but it seems that their solution is to not print it. But whether I print it or not, the dataset doesn't work. – Shiloh_C Mar 12 '18 at 11:47

1 Answers1

0

Does not have the framework available to test it but my approach would be the following:

print(str(fileDataSet[0], encoding='utf-8', errors='ignore'))

ofcourse it might not be utf-8 either but you can change to another encoding as well here, and ignoring errors will at least let you print the string.

My understanding is that it is a byte string conversion that is being made by the print function.

An alternative might be using the bytes.decode(encoding="utf-8", errors="strict") directly (assuming its a byte string)

Svavelsyra
  • 140
  • 8
  • Thank you, but it doesn't work. I feel like the issue is about the PyTorch. I have some misunderstanding and the question description was not accurate. – Shiloh_C Mar 12 '18 at 10:26