Using tensorflow, I am trying to convert a dataframe to a tds so that I can do some NLP work with it. It is all text data.
>>> df.dtypes
title object
headline object
byline object
dateline object
text object
copyright category
country category
industry category
topic category
file object
dtype: object
This is labeled data, where df.topics, df.country, df.industry
are labels for df.text
. I am trying to build a model to predict the topic of the text, given this labeled dataset and using BERT. Before I get to to that, however, I am converting df
into a tds.
import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split
# Load tokenizer and logger
tf.get_logger().setLevel('ERROR')
tokenizer = AutoTokenizer.from_pretrained('roberta-large')
# Load dataframe with just text and topic columns
df = pd.read_csv('test_dataset.csv', sep='|',
dtype={'topic': 'category', 'country': 'category', 'industry': 'category', 'copyright': 'category'})
# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)
# Convert df to tds
train_ds = tf.data.Dataset.from_tensor_slices(dict(train))
val_ds = tf.data.Dataset.from_tensor_slices(dict(val))
test_ds = tf.data.Dataset.from_tensor_slices(dict(test))
for feature_batch, label_batch in train_ds.take(1):
print('Every feature:', list(feature_batch.keys()))
print('A batch of topics:', feature_batch['topic'])
print('A batch of targets:', label_batch )
I am getting ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float).
on line 23, in <module> train_ds = tf.data.Dataset.from_tensor_slices(dict(train))
.
How can this be the case when this is all text data? How do I fix this?
I have looked at Tensorflow - ValueError: Failed to convert a NumPy array to a Tensor (Unsupported object type float) but it does not solve my problem.