2

I have a list of dicts as follows:

[{'text': ['The', 'Fulton', 'County', 'Grand', ...], 'tags': ['AT', 'NP-TL', 'NN-TL', 'JJ-TL', ...]},
 {'text': ['The', 'jury', 'further', 'said', ...], 'tags': ['AT', 'NN', 'RBR', 'VBD', ...]},
 ...]

Each value of each dict is a list of sentence-words/tags. This is directly from the Brown corpus of the NLTK dataset, loaded using:

from nltk.corpus import brown
data = brown.tagged_sents()
data = {'text': [[word for word, tag in sent] for sent in data], 'tags': [[tag for word, tag in sent] for sent in data]}

import pandas as pd
df = pd.DataFrame(training_data, columns=["text", "tags"])

from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.2)
train.to_json("train.json", orient='records')
val.to_json("val.json", orient='records')

I want to load this json into a torchtext.data.TabularDataset using:

TEXT = data.Field(lower=True)
TAGS = data.Field(unk_token=None)

data_fields = [('text', TEXT), ('tags', TAGS)]
train, val = data.TabularDataset.splits(path='./', train='train.json', validation='val.json', format='json', fields=data_fields)

But it gives me this error:

/usr/local/lib/python3.6/dist-packages/torchtext/data/example.py in fromdict(cls, data, fields)
     17     def fromdict(cls, data, fields):
     18         ex = cls()
---> 19         for key, vals in fields.items():
     20             if key not in data:
     21                 raise ValueError("Specified key {} was not found in "

AttributeError: 'list' object has no attribute 'items'

Note that I don't want TabularDataset to tokenize the sentence for me as it is already tokenized by nltk. How do I approach this? (I cannot switch corpuses to something I can directly load from torchtext.dataset, I have to use the Brown Corpus)

Rwitaban Goswami
  • 427
  • 3
  • 13

1 Answers1

0

For those looking at this question now, note that it uses the legacy version of torchtext. You can use this functionality still but need to add legacy... e.g:

from torchtext import data
from torchtext import datasets
from torchtext import legacy

TEXT = legacy.data.Field()
TAGS = legacy.data.Field()

I would then suggest formatting data_fields like this:

fields = {'text': ('text', TEXT), 'tag': ('tag', TAGS)}

That should do the trick. For anyone using the latest torchtext functionality, the way to do this is:

To create an iterable dataset, you can use the _RawTextIterableDataset function. Here is an example that loads from a json file:

def _create_data_from_json(data_path):
    with open(data_path) as json_file:
        raw_json_data = json.load(json_file)
        for item in raw_json_data:
            _label, _paragraph = item['tags'], item['text']
            yield (_tag, _text)


#Load torchtext utilities needed to convert (label, paragraph) tuple into iterable dataset               
from torchtext.data.datasets_utils import (
    _RawTextIterableDataset,
    _wrap_split_argument,
    _add_docstring_header,
    _create_dataset_directory,
)

#Dictionary of data sources. The train and test data JSON files have items consisting of paragraphs and labels
DATA_SOURCE = {
    'train': 'data/train_data.json',
    'test': 'data/test_data.json'
}

#This is the number of lines/items in each data set
NUM_LINES = {
    'train': 200,
    'test': 100,
}

#Naming the dataset
DATASET_NAME = "BAR"

#This function return the iterable dataset based on whatever split is passed in
@_add_docstring_header(num_lines=NUM_LINES, num_classes=2)
@_create_dataset_directory(dataset_name=DATASET_NAME)
@_wrap_split_argument(('train', 'test'))
def FOO(root, split):
    return _RawTextIterableDataset(DATASET_NAME, NUM_LINES[split],
                                 _create_data_from_json(DATA_SOURCE[split]))

You can then call this function to return your iterable dataset:

#Get iterable for train and test data sets
train_iter, test_iter = FOO(split=('train', 'test'))

The _create_data_from_json function can be replaced with any function that yields a tuple from a data source.

KGM
  • 123
  • 9