Comparing methods for a QA system on a 1,000-document Markdown dataset: Indexes and embeddings with GPT-4 vs. retraining GPT4ALL (or similar)

Question

I am working on a project to build a question-answering system for a documentation portal containing over 1,000 Markdown documents, with each document consisting of approximately 2,000-4,000 tokens.

I am considering the following two options:

Using indexes and embeddings with GPT-4
Retraining a model like GPT4ALL (or a similar model) to specifically handle my dataset

Which of these approaches is more likely to produce better results for my use case?

Does this answer your question? [Customize (fine-tune) OpenAI model: How to make sure answers are from customized (fine-tuning) dataset?](https://stackoverflow.com/questions/74000154/customize-fine-tune-openai-model-how-to-make-sure-answers-are-from-customized) — Rok Benko, Apr 10 '23 at 11:29
@RokBenko this is the option #1 I mentioned in my question (I’ve been using langchain with indexes powered by FAISS). But I’m not so impressed with the quality of answers, when I use indexes over embeddings. I’ll probably go ahead trying out option #2 with the OSS LLMs (Baize, Vicuna, etc) — Vasil Remeniuk, Apr 10 '23 at 12:24

score 1 · Answer 1 · answered Apr 10 '23 at 04:59

1

1000 files with a limited number of data cannot give you a good result if you retrain. Use embedding instead. I tried the same for my clients and finally choose embedding over fine-tuning a model.

answered Apr 10 '23 at 04:59

LuckyCoder

520
8
27

Comparing methods for a QA system on a 1,000-document Markdown dataset: Indexes and embeddings with GPT-4 vs. retraining GPT4ALL (or similar)

1 Answers1