0

For fine-tuning of the large language models (llama2), what should be the format(.text/.json/.csv) and structure (like should be an excel or docs file or prompt and response or instruction and output) of the training dataset? And also how to prepare or organise the tabular dataset for training purpose?

I made a spreadsheet which contain around 2000 instruction and output pair and use meta-llama/Llama-2-13b-chat-hf model. But when start querying through the spreadsheet using the above model it gives wrong answers most of the time & also repeat it many times. So I want to know that what kind of docs format & it's structure i should try for fine-tuning the llama2.

1 Answers1

0

You can try huggingface Datasets library.

for all json files under dir your_file_dir, try

from Datasets import load_dataset

My_dataset = load_dataset('json',data_files= "your_file_dir/*.json")

you can also define the datafiles for train test splits and other api at https://huggingface.co/docs/datasets/loading#json

  • Thanks for your response. But I want to fine-tune llama2-chat model for my own dataset. I would like to know what should be the format and structure of the dataset to achieve the best performance. I mean how the dataset to be prepared? – aiwesee Aug 23 '23 at 03:25
  • I don't think that dataset format has much to do with the performance, the structure you mean is just strings stored in a specific form, in fact, the input of the llama model are ```input_ids``` and ```attention_mask```, which can be generated by calling ``` tokenizer("your text here", maxl_length = 256) ```, if you are searching for suitable – Xinlong lee Aug 23 '23 at 07:15