3

I am working on a project to build a question-answering system for a documentation portal containing over 1,000 Markdown documents, with each document consisting of approximately 2,000-4,000 tokens.

I am considering the following two options:

  1. Using indexes and embeddings with GPT-4
  2. Retraining a model like GPT4ALL (or a similar model) to specifically handle my dataset

Which of these approaches is more likely to produce better results for my use case?

Vasil Remeniuk
  • 20,519
  • 6
  • 71
  • 81
  • Does this answer your question? [Customize (fine-tune) OpenAI model: How to make sure answers are from customized (fine-tuning) dataset?](https://stackoverflow.com/questions/74000154/customize-fine-tune-openai-model-how-to-make-sure-answers-are-from-customized) – Rok Benko Apr 10 '23 at 11:29
  • 1
    @RokBenko this is the option #1 I mentioned in my question (I’ve been using langchain with indexes powered by FAISS). But I’m not so impressed with the quality of answers, when I use indexes over embeddings. I’ll probably go ahead trying out option #2 with the OSS LLMs (Baize, Vicuna, etc) – Vasil Remeniuk Apr 10 '23 at 12:24

1 Answers1

1

1000 files with a limited number of data cannot give you a good result if you retrain. Use embedding instead. I tried the same for my clients and finally choose embedding over fine-tuning a model.

LuckyCoder
  • 520
  • 8
  • 27