0

I have a list of text comments and a list of their labels. I want to fine tune LLM; for this i need to create a tensor dataset below is the code I am using.

#LIST OF ALL LABELS
labels_list = [label_dictionary[category] for category in training_data['Area']]
print('This is label list')
print(labels_list)

#TEXT LIST
text_list = [y for y in training_data['Pain Points']]
print('This is text list')
text_list

output

This is label list [0, 1, 0, 2, 3, 1, 2, 1]  

This is text list  

['oblems Mentioned in Text:',  'App not working properly, getting errors as soon as contest is completed and unable to change players.',  'Poor user interface and experience making it difficult to navigate and use.',  'Buying reviews for the app.',  'Server glitches taking 2-4 hours to fix problems mentioned above.',  'Unable to complete process due to persistent errors on server being unable to handle huge amount of requests for editing squad information .',  'When trying to login, OTP not received after entering mobile number .',  "Takes years to load score/leaderboard results; 1 hour delayed updates on leaderboards/scores . 8 No information about position when joining a 10 rupees contest; auto joins more expensive contests than desired with no way of changing back 9 Bot opponents don't play, wasting money 10 Crashing often 11 Very poor performance from servers 12 Too many bugs 13 Copied concept 14 Confusing UI 15 Claims rewards 16 Crashed just before first IPL match 17 Andhra Pradesh & Telangana states are unavailable 18 Unable receive OTP 19 Second thing verified PAN status still unverified"]

now I am tokenizing it

model_name = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)

tokenized_text = tokenizer(text_list, truncation=True)

and using tensorflow i am creating a dataset

import tensorflow as tf

dataset = tf.data.Dataset.from_tensor_slices((dict(tokenized_text),labels_list))

output

TypeError: Could not build aTypeSpec` for [[1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]] with type list 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/constant_op.py in convert_to_eager_tensor(value, ctx, dtype)
101       dtype = dtypes.as_dtype(dtype).as_datatype_enum
102   ctx.ensure_initialized()
--> 103   return ops.EagerTensor(value, ctx.device_name, dtype)
104
105

ValueError: Can't convert non-rectangular Python sequence to Tensor.
desertnaut
  • 57,590
  • 26
  • 140
  • 166

1 Answers1

0

Try using Ragged Tensors, but I dont have lot of experience when it comes to training with them. The more complex/flexible approach is to define your own dataset with dataset.from_generator where you can define how to get one element of your dataset.

L Maxime
  • 82
  • 1
  • 8