According to the documentation https://beta.openai.com/docs/guides/fine-tuning the training data to fine tune an OpenAI GPT3 model should be structured as follows:
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
I have a collection of documents from an internal knowledge base that have been preprocessed into a JSONL file in a format like this:
{ "id": 0, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 1, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
{ "id": 2, "name": "Article Name", "description": "Article Description", "created_at": "timestamp", "updated_at": "timestamp", "answer": { "body_txt": "An internal knowledge base article with body text", }, "author": { "name": "First Last"}, "keywords": [], "url": "A URL to internal knowledge base"}
The documentation then suggests that a model could then be fine tuned on these articles using the command openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>
.
Running this results in:
Error: Expected file to have JSONL format with prompt/completion keys. Missing
prompt
key on line 1. (HTTP status code: 400)
Which isn't unexpected given the documented file structure noted above. Indeed if I run openai tools fine_tunes.prepare_data -f training-data.jsonl
then I am told:
Your file contains 490 prompt-completion pairs ERROR in necessary_column validator:
prompt
column/key is missing. Please make sure you name your columns/keys appropriately, then retry`
Is this is right approach to trying to fine tune a GTP3 model on collections of documents, such that questions could later be asked about the content of them. What would one put in the prompt
and completion
fields in this case since I am not starting from a place where I have a collection of possible question and ideal answers.
Have I fundamentally misunderstood the mechanism used to fine tune a GTP3 model? It does make sense to me that GTP3 would need to be trained on possible questions and answers. However, given the base models are already trained and this process is more above providing additional datasets which aren't in the public domain so that questions can be asked about it I would have thought what I want to achieve is possible. As a working example, I can indeed go to https://chat.openai.com/ and ask a question about these documents as follows:
Given the following document:
[Paste the text content of one of the documents]
Can you tell me XXX
And indeed it often gets the answer right. What I'm now trying to do it fine tune the model on ~500 of these documents such that one doesn't have to paste whole single documents each time a question is to be asked and such that the model might even be able to consider content across all ~500 rather than just the single one that user provided.