How to train ChatGPT with custom data

Question

I want to create a chatbot on my website with ChatGPT. I have some pre-defined question-answers like the below:

Question: What is the price of ...?
Answer: $100

Question: How this help ..?
Anwer: 1) Improve... 2) Better... 3) More...

When the customer ask a questoin related to the pre-defined question, it should grab the answer from the pre-defined question and use natural language to answer the customer.

But I don't know the logic to implement this. There are three roles for chat completion ( system, user, assistant ).

Do I insert all thiese pre-defined question and answer in the system role like:

[
   'role' => 'system',
   'content' => 'I write all the information here'
],

Or I write it all in a single user prompt like:

[
   'role' => 'system',
   'content' => 'You're are a helpful assistant'
],
[
   'role' => 'user',
   'content' => 'I write all the information here'
]

Or I separate it into different user prompmt like:

[
   'role' => 'system',
   'content' => 'You're are a helpful assistant'
],
[
   'role' => 'user',
   'content' => 'First pre-defined question and answer...'
],
[
   'role' => 'user',
   'content' => 'Second pre-defined question and answer...'
],
[
   'role' => 'user',
   'content' => 'Third pre-defined question and answer...'
]

Is this the correct way of training a chatbot?

related questions?: https://stackoverflow.com/q/75729386/11107541 and https://stackoverflow.com/q/75811594/11107541. Also possibly https://stackoverflow.com/q/76612226/11107541 — starball, Jul 08 '23 at 03:51
Wouldn't it be possible to use any of the models that `gpt4all` uses, to fine tune based on what we want? Or perhaps use `LangChain`? — Nav, Jul 29 '23 at 15:40

h2stein · Answer 1 · 2023-06-27T23:13:40.553

This is not a particularly good use case for the newer OpenAI GPT models because they do not yet allow fine tuning. If you specify the information in the prompt, you will probably exceed the token limit of the GPT model quite soon. And then there is no guarantee that GPT will heed your prompt, but will answer the user's question based on its pre-trained knowledge.

If you still want to use an up-to-date GPT model from OpenAI, you have two alternatives:

you enter all question-answer pairs in a user message and tell GPT in the system message that this is the context you want to work with. This is called the "Provide Reference Text Strategy".
you feed each question-answer pair in individual messages, with the question in a user message and the answer in an assistant message. This is called the "Provide examples tactic" or commonly "few-shot learning".

The first approach will work to some degree and is more efficient than the second approach. The second approach is the way GPT was trained, so this might give better results. There is no way to know, you have to try it out.

For your use case, there are probably better methods. Here are three alternatives:

use a model that allows fine-tuning (from OpenAI, another vendor, or an open-source model) and match it to your data.
use an embedding model (again from OpenAI, another vendor, or an open source model) to compute an embedding vector for the query and store the results in a vector database. This is called the "Use embeddings-based search to implement efficient knowledge retrieval tactic".
use the new function calling feature of the OpenAI API to allow GPT to ask for more information, and combine this with the embedding method from (2). This will probably give the best results, but at a higher cost and complexity (2x full API + 1x embedding API). This is the "Use code execution to perform more accurate calculations or call external APIs tactic".

OpenAI has a good tutorial that you may want to read "How to build an AI that can answer questions about your website".

How to train ChatGPT with custom data

1 Answers1