8

According to the documentation https://beta.openai.com/docs/guides/fine-tuning the training data to fine tune an OpenAI GPT3 model should be structured as follows:

{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

I have a collection of documents from an internal knowledge base that have been preprocessed into a JSONL file in a format like this:

{  "id": 0,  "name": "Article Name",  "description": "Article Description",  "created_at": "timestamp",  "updated_at": "timestamp",  "answer": {    "body_txt": "An internal knowledge base article with body text",  },  "author": {    "name": "First Last"},  "keywords": [],  "url": "A URL to internal knowledge base"}
{  "id": 1,  "name": "Article Name",  "description": "Article Description",  "created_at": "timestamp",  "updated_at": "timestamp",  "answer": {    "body_txt": "An internal knowledge base article with body text",  },  "author": {    "name": "First Last"},  "keywords": [],  "url": "A URL to internal knowledge base"}
{  "id": 2,  "name": "Article Name",  "description": "Article Description",  "created_at": "timestamp",  "updated_at": "timestamp",  "answer": {    "body_txt": "An internal knowledge base article with body text",  },  "author": {    "name": "First Last"},  "keywords": [],  "url": "A URL to internal knowledge base"}

The documentation then suggests that a model could then be fine tuned on these articles using the command openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>.

Running this results in:

Error: Expected file to have JSONL format with prompt/completion keys. Missing prompt key on line 1. (HTTP status code: 400)

Which isn't unexpected given the documented file structure noted above. Indeed if I run openai tools fine_tunes.prepare_data -f training-data.jsonl then I am told:

Your file contains 490 prompt-completion pairs ERROR in necessary_column validator: prompt column/key is missing. Please make sure you name your columns/keys appropriately, then retry`

Is this is right approach to trying to fine tune a GTP3 model on collections of documents, such that questions could later be asked about the content of them. What would one put in the prompt and completion fields in this case since I am not starting from a place where I have a collection of possible question and ideal answers.

Have I fundamentally misunderstood the mechanism used to fine tune a GTP3 model? It does make sense to me that GTP3 would need to be trained on possible questions and answers. However, given the base models are already trained and this process is more above providing additional datasets which aren't in the public domain so that questions can be asked about it I would have thought what I want to achieve is possible. As a working example, I can indeed go to https://chat.openai.com/ and ask a question about these documents as follows:

Given the following document:

[Paste the text content of one of the documents]

Can you tell me XXX

And indeed it often gets the answer right. What I'm now trying to do it fine tune the model on ~500 of these documents such that one doesn't have to paste whole single documents each time a question is to be asked and such that the model might even be able to consider content across all ~500 rather than just the single one that user provided.

David
  • 7,652
  • 21
  • 60
  • 98

2 Answers2

9

Fine-tuning is a process of modifying a pre-trained machine learning model to suit the needs of a particular task. It is not done to provide the model with an internal knowledge-base. Instead of fine-tuning the model, you can create a database of embeddings for chunks of data from the knowledge-base. This database can then be used to semantically search for the most relevant information in response to a query. When a query is received, the database can be searched to find the chunk(s) of data that is most similar to the query. This information can then be fed to GPT-3 to provide answers from. With this approach you can easily update the knowledge by adding new chunks of data to the database.

  • 1
    Thanks for your answer. I am not sure why it was downvoted but could you expand upon this? Conceptually what you are saying makes sense, however looking at documentation about embeddings https://beta.openai.com/docs/guides/embeddings/what-are-embeddings the input is some text (which is great) and the output is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. It's not clear to me how this can then be used to ask one of models a prompt which would consider the documents based on https://beta.openai.com/docs/api-reference/completions/create – David Jan 30 '23 at 16:00
  • 2
    You can take the query, convert it into a vector then calculate cosine similarity of this vector with vectors of every chunk of knowledge base. You will get a number of most similar chunks that may contain the answer. Then pass these similar chunks to gpt-3 to get the answer from. This YT tutorial from David Shapiro is a very good example: https://www.youtube.com/watch?v=es8e4SEuvV0 – Muneeb Ur Rahman Jan 30 '23 at 18:47
  • 2
    You may also be interested in the GPT Index project, which aims to provide a solution to allow the use of knowledge bases larger than the LLM's context size https://gpt-index.readthedocs.io/en/latest/ The Vector Store index in particular makes use of embeddings. – mickdekkers Feb 08 '23 at 22:28
1

I would try the process outlined here: https://blog.truefoundry.com/training-fine-tuning-of-llms-with-your-own-data/

They have similar use case to your own and describe examples they fed to the LLM. Specifically under: Case Study Fine Tuning with Confluence Docs -> Data Preprocessing. In their case, they tried 3 types of “prompt” “completion” pairs: random splitting of sentences, regex and ChatGPT.

I know it may not be the exact answer you’re looking for but hopefully can spark some ideas on how you could go from raw_dataset -> “prompt” “completion” pairs.