0

Using tensorflow, I am trying to convert a dataframe to a tds so that I can do some NLP work with it. It is all text data.

>>> df.dtypes
title          object
headline       object
byline         object
dateline       object
text           object
copyright    category
country      category
industry     category
topic        category
file           object
dtype: object

This is labeled data, where df.topics, df.country, df.industry are labels for df.text. I am trying to build a model to predict the topic of the text, given this labeled dataset and using BERT. Before I get to to that, however, I am converting df into a tds.

import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split

# Load tokenizer and logger
tf.get_logger().setLevel('ERROR')
tokenizer = AutoTokenizer.from_pretrained('roberta-large')

# Load dataframe with just text and topic columns
df = pd.read_csv('test_dataset.csv', sep='|',
                dtype={'topic': 'category', 'country': 'category', 'industry': 'category', 'copyright': 'category'})

# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)

# Convert df to tds
train_ds = tf.data.Dataset.from_tensor_slices(dict(train))
val_ds = tf.data.Dataset.from_tensor_slices(dict(val))
test_ds = tf.data.Dataset.from_tensor_slices(dict(test))

for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of topics:', feature_batch['topic'])
  print('A batch of targets:', label_batch )

I am getting ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float). on line 23, in <module> train_ds = tf.data.Dataset.from_tensor_slices(dict(train)).

How can this be the case when this is all text data? How do I fix this?

I have looked at Tensorflow - ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float) but it does not solve my problem.

DrakeMurdoch
  • 765
  • 11
  • 26
  • Are you sure that there are no nulls in your text columns? If there are null values, they will be considered as NAN's (float datatype) due to which you might be getting this error. – Abhilash Rajan Apr 23 '21 at 04:13
  • The problem is that from_tensor_slices needs to convert its input into a Tensor, but the given input might contain variable-length numpy lists, which cannot be converted into tensors (tensors must be rectangular). You can get the same error message by running ` a = np.array([[1, 2, 3], [4, 5]], dtype=object) `print(tf.convert_to_tensor(a)) To make this work, you need to pad the dataframe's lists so that they are the same length. –  Apr 26 '21 at 11:32

1 Answers1

0

If you are using tf.data.Dataset.from_tensor_slices you have to first convert your data into a numpy array. Also, since this is using text data, you also need to tokenize your data.

# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]

# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]

y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]

# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)

# Convert dfs to tds
train_ds = tf.data.Dataset.from_tensor_slices((dict(tr_tok), y_train))
val_ds = tf.data.Dataset.from_tensor_slices((dict(val_tok), y_val))
test_ds = tf.data.Dataset.from_tensor_slices((dict(test_tok), y_test))

That should solve the problem.

DrakeMurdoch
  • 765
  • 11
  • 26