3

How to read in .txt file (corpus) into torchtext in pytorrch?

I only see data.Dataset for example datasets and data.TabularData for csv, json, and tsv.

https://github.com/pytorch/text#data

https://torchtext.readthedocs.io/en/latest/data.html#dataset

It still works if I read it in using a Tabular dataset like this: test_file = data.TabularDataset(path=input_filepath, format='csv', fields=[('text', data.Field())])

But my dataset is not tabular, so I wanted to check to see if there was a better option.

pr338
  • 8,730
  • 19
  • 52
  • 71

1 Answers1

0

I would suggest writing up a quick script to read your corpus and dump it to JSON (there are plenty of examples out there), then use that JSON with torchtext. You're going to want to have some sort of structure to your data to get the most out of torchtext (think batches/iterable datasets).

If you are lost on how to iterate through a dataset, check out my other answer here.

KGM
  • 123
  • 9