Objective
My goal is to fine-tune a pre-trained LLM on a dataset about Manchester United's (MU's) 2021/22 season (they had a poor season). I want to be able to prompt the fine-tuned model with questions such as "How can MU improve?", or "What are MU's biggest weaknesses?". The ideal responses would be insightful/logical and +100 words
Data
- I will simply use text from the relevant wiki page as my data: https://en.wikipedia.org/wiki/2021%E2%80%9322_Manchester_United_F.C._season
- How should I structure my data? Should it be a list dictionaries where the keys are the questions and the values are the answers (i.e. a list of question-answer pairs), or a long string containing all the text data (for context), or a combination of both?
Notes
- I have mainly been experimenting with variations of Google's T5 (e.g.: https://huggingface.co/t5-base) which I have imported from the Hugging Face Transformers library
- So far I have only fine-tuned the model on a list of 30 dictionaries (question-answer pairs), e.g.: {"question": "How could Manchester United improve their consistency in the Premier League next season?", "answer": " To improve consistency, Manchester United could focus on strengthening their squad depth to cope with injuries and fatigue throughout the season. Tactical adjustments could also be explored to deal with teams of different strengths and styles."}
- Use of this small dataset (list of 30 dictionaries) has given poor results
Further Questions and Notes
- Other than increasing the size of my dataset, is my approach sound?
- What would you recommend as a minimum number of dictionaries to train/fine-tune the model on?
- I am also aware that I can tune the hyperparameters to improve performance, but for now I am more concerned about my general approach being logical